基于多尺度特征增强与全局-局部特征聚合的视频目标分割算法

侯志强; 董佳乐; 马素刚; 王晨旭; 杨小宝; 王昀琛

doi:10.11999/JEIT231394

基于多尺度特征增强与全局-局部特征聚合的视频目标分割算法

doi: 10.11999/JEIT231394 cstr: 32379.14.JEIT231394

1.
西安邮电大学计算机学院西安 710121
2.
西安邮电大学陕西省网络数据分析与智能处理实验室西安 710121

基金项目: 国家自然科学基金(62072370)，陕西省自然科学基金(2023-JC-YB-598)

详细信息

作者简介:
侯志强：男，博士，教授，研究方向为计算机视觉、目标跟踪等

董佳乐：男，硕士生，研究方向为计算机视觉、视频目标分割等

马素刚：男，博士，教授，研究方向为计算机视觉、机器学习等

王晨旭：男，硕士生，研究方向为计算机视觉、视频目标分割等

杨小宝：男，博士，讲师，研究方向为计算机图形学、人工智能等

王昀琛：女，博士，讲师，研究方向为计算机图形学、图像分类等

通讯作者:
董佳乐　djl112299@163.com

中图分类号: TN911.73; TP391.41
计量
- 文章访问数: 1004
- HTML全文浏览量: 695
- PDF下载量: 105
- 被引次数: 0
出版历程
- 收稿日期: 2023-12-18
- 修回日期: 2024-09-25
- 网络出版日期: 2024-09-30
- 刊出日期: 2024-11-01

Video Object Segmentation Algorithm Based on Multi-scale Feature Enhancement and Global-Local Feature Aggregation

1.
Institute of Computer, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
2.
Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi’an University of Posts and Telecommunications, Xi’an 710121, China

Funds: The National Natural Science Foundation of China (62072370), The Natural Science Foundation of Shaanxi Province (2023-JC-YB-598)

摘要

摘要: 针对记忆网络算法中多尺度特征表达能力不足和浅层特征没有充分利用的问题，该文提出一种多尺度特征增强与全局-局部特征聚合的视频目标分割(VOS)算法。首先，通过多尺度特征增强模块融合可参考掩码分支和可参考RGB分支的不同尺度特征信息，增强多尺度特征的表达能力；同时，建立了全局-局部特征聚合模块，利用不同大小感受野的卷积操作来提取特征，并通过特征聚合模块来自适应地融合全局区域和局部区域的特征，这种融合方式可以更好地捕捉目标的全局特征和细节信息，提高分割的准确性；最后，设计了跨层融合模块，利用浅层特征的空间细节信息来提升分割掩码的精度，通过将浅层特征与深层特征融合，能更好地捕捉目标的细节和边缘信息。实验结果表明，在公开数据集DAVIS2016, DAVIS2017和YouTube-2018上，该文算法的综合性能分别达到91.8%、84.5%和83.0%，在单目标和多目标分割任务上都能实时运行。
- 视频目标分割 /
- 记忆网络 /
- 孪生网络 /
- 特征融合 /
- 掩码细化
Abstract: To address the issues of insufficient multi-scale feature expression ability and insufficient utilization of shallow features in memory network algorithms, a Video Object Segmentation (VOS) algorithm based on multi-scale feature enhancement and global local feature aggregation is proposed in this paper. Firstly, the multi-scale feature enhancement module fuses different scale feature information from reference mask branches and reference RGB branches to enhance the expression ability of multi-scale features; At the same time, a global local feature aggregation module is established, which utilizes convolution operations of different sizes of receptive fields to extract features, through the feature aggregation module, the features of the global and local regions are adaptively fused. This fusion method can better capture the global features and detailed information of the target, improving the accuracy of segmentation; Finally, a cross layer fusion module is designed to improve the accuracy of masks segmentation by utilizing the spatial details of shallow features. By fusing shallow features with deep features, it can better capture the details and edge information of the target. The experimental results show that on the public datasets DAVIS2016, DAVIS2017, and YouTube 2018, the comprehensive performance of our algorithm reaches 91.8%, 84.5%, and 83.0%, respectively, and can run in real-time on both single and multi-objective segmentation tasks.
- Video Object Segmentation (VOS) /
- Memory network /
- Siamese network /
- Feature fusion /
- Mask refinement

HTML全文

图 1 多尺度特征增强与全局-局部特征聚合的视频目标分割算法整体框架

下载: 全尺寸图片幻灯片

图 2 多尺度特征增强模块

下载: 全尺寸图片幻灯片

图 3 全局-局部特征聚合模块

下载: 全尺寸图片幻灯片

图 4 跨层融合模块

下载: 全尺寸图片幻灯片

图 5 本文算法在DAVIS2016和 DAVIS2017验证集上与近年算法的性能和速度比较

下载: 全尺寸图片幻灯片

图 6 本文算法与对比算法在DAVIS2017数据集上的部分分割结果比较

下载: 全尺寸图片幻灯片

图 7 本文算法在DAVIS2017数据集和YouTube-2018数据集的部分定性结果展示

下载: 全尺寸图片幻灯片

表 1 DAVIS2016和DAVIS2017验证集不同算法的性能比较

算法	来源	DAVIS2016					DAVIS2017
算法	来源	J&F	J	F	速度(fps)	时间(s)	J&F	J	F	速度(fps)	时间(s)
OSVOS ^[5]	CVPR2017	80.2	79.8	80.6	0.10	10.00	60.3	56.6	63.9	0.1	10.00
OnAVOS^[7]	CVPRW2017	85.5	86.1	84.9	0.08	12.50	63.6	61.0	66.1	0.05	22.0
OSVOS-S^[25]	TPAMI2018	86.6	85.6	87.5	0.20	5.00	68.0	64.7	71.3	0.1	10.00
OSNM^[26]	CVPR2018	73.5	74	72.9	7.70	0.13	54.8	52.5	57.1	7.0	0.14
FAVOS^[27]	CVPR2018	82.4	79.5	80.9	0.60	1.67	58.2	54.6	61.8	5.6	0.18
AGAME^[14]	CVPR2019	82.1	82.0	82.2	14.00	0.07	70.0	67.4	72.6	14.0	0.07
RANet^[28]	ICCV2019	85.5	85.5	85.4	33.00	0.03	65.7	63.2	68.2	33.0	0.03
FTMU^[29]	CVPR2020	78.9	77.5	80.3	11.00	0.09	70.6	69.1	72.1	11.0	0.09
SSM^[19]	T-CSVT2021	85.9	86.2	85.6	37.00	0.03	77.6	75.3	79.9	--	--
TMO^[20]	TCSVT2023	86.1	85.6	86.6	43.20	0.02	72.3	69.9	74.7	37.0	0.03
STM^[11]	ICCV2019	89.3	88.7	89.9	10.30	0.10	81.8	79.2	84.3	8.8	0.11
FRTM^[21]	CVPR2020	83.6	83.7	83.4	21.9	0.05	76.7	73.8	79.6	21.9	0.05
GC^[15]	ECCV2020	86.6	87.6	85.7	25.00	0.04	71.4	69.3	73.5	--	--
KMN^[16]	ECCV2020	90.5	89.5	83.6	9.00	0.11	82.8	80.0	85.6	8.0	0.13
TransVOS^[22]	CVPR2021	90.5	89.8	91.2	--	--	83.9	81.4	86.4	--	--
MTMFI^[23]	Neurocomputing2022	85.2	84.9	85.5	13.70	0.07	77.6	74.6	80.6	13.7	0.07
ILTR^[24]	计算机学报2022	84.6	84.9	84.3	18.00	0.06	72.9	70.0	75.8	--	--
KMN^M[17]	TPAMI2022	91.2	90.2	92.1	8.00	0.13	83.5	80.9	86.1	8.0	0.13
LLB^[30]	AAAI2023	--	--	--	--	--	84.6	81.5	87.7	8.3	0.12
MGLAS	本文	91.8	90.6	93.0	33.45	0.03	84.5	81.6	87.3	26.6	0.04

下载: 导出CSV

表 2 YouTube-2018验证集不同算法的性能比较

算法	来源	G	J_s	J_u	F_s	F_u
MSK^[13]	CVPR2017	53.1	59.9	45.0	59.5	47.9
OnAVOS^[7]	CVPRW2017	55.2	60.1	46.6	62.7	51.4
OSVOS^[5]	CVPR2017	58.8	59.8	54.2	60.5	60.7
OSNM^[26]	CVPR2018	51.2	60.0	40.6	60.1	44.0
RGMP^[8]	CVPR2018	53.8	59.5	45.2	--	--
AGAME^[14]	CVPR2019	66.0	66.9	61.2	--	--
STM^[11]	ICCV2019	78.9	78.6	73.3	82.8	80.9
FRTM^[21]	CVPR2020	65.7	68.6	58.4	71.3	64.5
SSM^[19]	T-CSVT2021	66.5	72.3	57.8	73.3	62.6
TranVOS^[22]	CVPR2021	81.8	82.0	75.0	86.7	83.4
ILTR^[24]	计算机学报2022	73.8	73.9	67.5	77.9	75.7
KMN^M[17]	TPAMI2022	81.4	81.4	75.3	85.6	83.3
LLB^[30]	AAAI2023	83.8	82.1	79.1	87.0	87.0
MGLAS	本文	83.0	81.9	77.9	86.5	85.7

下载: 导出CSV

表 3 本文算法在DAVIS2017验证集上的消融实验

基准算法	MFEM	GLFAM	CFM	J&F	J	F
√				81.8	79.2	84.3
√	√			83.2	79.9	86.5
√		√		83.5	80.6	86.4
√			√	83.5	80.0	86.9
√	√	√	√	84.5	81.6	87.3

下载: 导出CSV

参考文献(30)

[1]	ERDÉLYI A, BARÁT T, VALET P, et al. Adaptive cartooning for privacy protection in camera networks[C]. 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Seoul, Korea (South), 2014: 44–49. doi: 10.1109/AVSS.2014.6918642.
[2]	WANG Wenguan, SHEN Jianbing, PORIKLI F, et al. Semi-supervised video object segmentation with super-trajectories[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(4): 985–998. doi: 10.1109/TPAMI.2018.2819173.
[3]	SALEH K, HOSSNY M, and NAHAVANDI S. Kangaroo vehicle collision detection using deep semantic segmentation convolutional neural network[C]. 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Gold Coast, Australia, 2016: 1–7. doi: 10.1109/DICTA.2016.7797057.
[4]	LU Xiankai, WANG Wenguan, SHEN Jianbing, et al. Learning video object segmentation from unlabeled videos[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 8957–8967. doi: 10.1109/CVPR42600.2020.00898.
[5]	CAELLES S, MANINIS K K, PONT-TUSET J, et al. One-shot video object segmentation[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 5320–5329. doi: 10.1109/CVPR.2017.565.
[6]	CHENG H K, TAI Y W, and TANG C K. Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 5555–5564. doi: 10.1109/CVPR46437.2021.00551.
[7]	VOIGTLAENDER P and LEIBE B. Online adaptation of convolutional neural networks for video object segmentation[C]. British Machine Vision Conference 2017, London, UK, 2017.
[8]	OH S W, LEE J Y, SUNKAVALLI K, et al. Fast video object segmentation by reference-guided mask propagation[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7376–7385. doi: 10.1109/CVPR.2018.00770.
[9]	徐金东, 赵甜雨, 冯国政, 等. 基于上下文模糊C均值聚类的图像分割算法[J]. 电子与信息学报, 2021, 43(7): 2079–2086. doi: 10.11999/JEIT200263. XU Jindong, ZHAO Tianyu, FENG Guozheng, et al. Image segmentation algorithm based on context fuzzy C-means clustering[J]. Journal of Electronics & Information Technology, 2021, 43(7): 2079–2086. doi: 10.11999/JEIT200263.
[10]	杭昊, 黄影平, 张栩瑞, 等. 面向道路场景语义分割的移动窗口变换神经网络设计[J]. 光电工程, 2024, 51(1): 230304. doi: 10.12086/oee.2024.230304. HANG Hao, HUANG Yingping, ZHANG Xurui, et al. Design of swin transformer for semantic segmentation of road scenes[J]. Opto-Electronic Engineering, 2024, 51(1): 230304. doi: 10.12086/oee.2024.230304.
[11]	OH S W, LEE J Y, XU Ning, et al. Video object segmentation using space-time memory networks[C]. The IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 9225–9234. doi: 10.1109/ICCV.2019.00932.
[12]	LUITEN J, VOIGTLAENDER P, and LEIBE B. PReMVOS: Proposal-generation, refinement and merging for video object segmentation[C]. 14th Asian Conference on Computer Vision, Perth, Australia, 2019: 565–580. doi: 10.1007/978-3-030-20870-7_35.
[13]	PERAZZI F, KHOREVA A, BENENSON R, et al. Learning video object segmentation from static images[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 3491–3500. doi: 10.1109/CVPR.2017.372.
[14]	JOHNANDER J, DANELLJAN M, BRISSMAN E, et al. A generative appearance model for end-to-end video object segmentation[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA: 2019: 8945–8954. doi: 10.1109/CVPR.2019.00916.
[15]	LI Yu, SHEN Zhuoran, and SHAN Ying. Fast video object segmentation using the global context module[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 735–750. doi: 10.1007/978-3-030-58607-2_43.
[16]	SEONG H, HYUN J, and KIM E. Kernelized memory network for video object segmentation[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 629–645. doi: 10.1007/978-3-030-58542-6_38.
[17]	SEONG H, HYUN J, and KIM E. Video object segmentation using Kernelized memory network with multiple kernels[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(2): 2595–2612. doi: 10.1109/TPAMI.2022.3163375.
[18]	KINGMA D P and BA J. Adam: A method for stochastic optimization[C]. 3rd International Conference on Learning Representations, San Diego, USA, 2015.
[19]	ZHU Wencheng, LI Jiahao, LU Jiwen, et al. Separable structure modeling for semi-supervised video object segmentation[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(1): 330–344. doi: 10.1109/TCSVT.2021.3060015.
[20]	CHO S, LEE M, LEE S, et al. Treating motion as option to reduce motion dependency in unsupervised video object segmentation[C]. The IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2023: 5129–5138. doi: 10.1109/WACV56688.2023.00511.
[21]	ROBINSON A, LAWIN F J, DANELLJAN M, et al. Learning fast and robust target models for video object segmentation[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 7404–7413. doi: 10.1109/CVPR42600.2020.00743.
[22]	MEI Jianbiao, WANG Mengmeng, LIN Yeneng, et al. TransVOS: Video object segmentation with transformers[J]. arXiv: 2106.00588, 2021. doi: 10.48550/arXiv.2106.00588.
[23]	GAO Bocong, ZHAO Yuqian, ZHANG Fan, et al. Video object segmentation based on multi-level target models and feature integration[J]. Neurocomputing, 2022, 492: 396–407. doi: 10.1016/j.neucom.2022.04.042.
[24]	徐凯, 李国荣, 洪德祥, 等. 结合在线归纳和直推推理的快速视频目标分割方法[J]. 计算机学报, 2022, 45(10): 2117–2132. doi: 10.11897/SP.J.1016.2022.02117. XU Kai, LI Guorong, HONG Dexiang, et al. A fast video object segmentation method based on inductive learning and transductive reasoning[J]. Chinese Journal of Computers, 2022, 45(10): 2117–2132. doi: 10.11897/SP.J.1016.2022.02117.
[25]	MANINIS K K, CAELLES S, CHEN Yuhua, et al. Video object segmentation without temporal information[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(6): 1515–1530. doi: 10.1109/TPAMI.2018.2838670.
[26]	YANG Linjie, WANG Yanran, XIONG Xuehan, et al. Efficient video object segmentation via network modulation[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6499–6507. doi: 10.1109/CVPR.2018.00680.
[27]	CHENG Jingchun, TSAI Y H, HUNG W C, et al. Fast and accurate online video object segmentation via tracking parts[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7415–7424. doi: 10.1109/CVPR.2018.00774.
[28]	WANG Ziqin, XU Jun, LIU Li, et al. RANet: Ranking attention network for fast video object segmentation[C]. The IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 3977–3986. doi: 10.1109/ICCV.2019.00408.
[29]	SUN Mingjie, XIAO Jimin, LIM E G, et al. Fast template matching and update for video object tracking and segmentation[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 10788–10796. doi: 10.1109/CVPR42600.2020.01080.
[30]	LAN Meng, ZHANG Jing, ZHANG Lefei, et al. Learning to learn better for video object segmentation[C]. The AAAI Conference on Artificial Intelligence, Washington, USA, 2023: 1205–1212. doi: 10.1609/aaai.v37i1.25203.