集成全局局部特征交互与角动量机制的端到端多目标跟踪算法

计忠平; 王相威; 何志伟; 杜晨杰; 金冉; 柴本成

doi:10.11999/JEIT240277

集成全局局部特征交互与角动量机制的端到端多目标跟踪算法

doi: 10.11999/JEIT240277

1.
杭州电子科技大学计算机学院杭州 310018
2.
浙江省装备电子研究重点实验室杭州 310018
3.
浙江万里学院大数据与软件工程学院宁波 315100

基金项目: 国家自然科学基金(61671192)，中国博士后科学基金(2017M114)，浙江省自然科学基金(LY22F020025)，浙江省教育厅一般项目(Y202351320)

详细信息

作者简介:
计忠平：男，教授，研究方向为计算机视觉、机器学习

王相威：男，硕士，研究方向为计算机视觉、目标跟踪

何志伟：男，教授，研究方向为计算机视觉、汽车电子技术

杜晨杰：男，讲师，研究方向为计算机视觉、目标跟踪

金冉：男，教授，研究方向为跨媒体分析、多模态信息处理

柴本成：男，副教授，研究方向为目标跟踪、智能信息处理

通讯作者:
杜晨杰　ducj@hdu.edu.cn

中图分类号: TN911.73;TP391.41
计量
- 文章访问数: 329
- HTML全文浏览量: 147
- PDF下载量: 47
- 被引次数: 0
出版历程
- 收稿日期: 2024-04-15
- 修回日期: 2024-08-25
- 网络出版日期: 2024-08-30
- 刊出日期: 2024-09-26

End-to-end Multi-Object Tracking Algorithm Integrating Global Local Feature Interaction and Angular Momentum Mechanism

1.
School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China
2.
Zhejiang Provincial Key Laboratory of Equipment Electronics, Hangzhou 310018, China
3.
College of Big Data and Software Engineering, Zhejiang Wanli University, Ningbo 315100, China

Funds: The National Natural Science Foundation of China (61671192), China Postdoctoral Science Foundation (2017M114), The Natural Foundation of Zhejiang Province (LY22F020025), The General Project of the Zhejiang Provincial Department of Education (Y202351320)

摘要

摘要: 针对多目标跟踪(MOT)算法性能对于检测准确度和数据关联策略的依赖性问题，该文提出一种新的端到端算法。在检测方面，首先基于特征金字塔网络，提出空间残差特征金字塔模块(SRFPN)，以提升特征融合和信息传递的效率。随后，引入全局局部特征交互模块(GLFIM)来平衡局部细节和全局上下文信息，增强多尺度特征的专注度，提高模型对目标尺度变化的适应性。在关联方面，引入角动量机制(AMM)，充分考虑目标运动方向，以提升连续帧之间目标匹配的精确性。在MOT17和UAVDT数据集上进行实验验证，所提跟踪器的检测性能和关联性能均显著提升，并且在目标遮挡、尺度变化和杂乱背景等复杂场景下表现出良好的鲁棒性。
- 目标跟踪 /
- 特征金字塔网络 /
- 全局局部特征交互 /
- 角动量
Abstract: A novel end-to-end algorithm is proposed to tackle the dependency of Multi-Object Tracking (MOT) algorithm performance on detection accuracy and data association strategies. Concerning detection, the Spatial Residual Feature Pyramid Network (SRFPN) is introduced based on feature pyramid networks to enhance feature fusion and information propagation efficiency. Subsequently, a Global Local Feature Interaction Module (GLFIM) is introduced to balance local details and global contextual information, thereby improving the focus of multi-scale feature outputs and the model’s adaptability to target scale variations. Regarding the association, an Angular Momentum Mechanism (AMM) is introduced to consider target motion direction, thereby enhancing the accuracy of target matching between consecutive frames. Experimental validation on MOT17 and UAVDT datasets demonstrates significant enhancements in both detection and association performance of the proposed tracker, showcasing robustness in complex scenarios such as target occlusion, scale variation, and cluttered backgrounds.
- Object tracking /
- Feature Pyramid Network(FPN) /
- Global local feature interactions /
- Angular momentum

HTML全文

图 1 本文跟踪算法框图

下载: 全尺寸图片幻灯片

图 2 空间残差特征金字塔网络

下载: 全尺寸图片幻灯片

图 3 全局局部特征交互模块

下载: 全尺寸图片幻灯片

图 4 SRFPN消融结果

下载: 全尺寸图片幻灯片

图 5 GLFIM消融结果

下载: 全尺寸图片幻灯片

图 6 AMM修正代价矩阵

下载: 全尺寸图片幻灯片

图 7 AMM消融结果

下载: 全尺寸图片幻灯片

图 8 MOT17数据集与UAVDT数据上的可视化结果

下载: 全尺寸图片幻灯片

表 1 MOT17数据集上的实验结果对比

算法	MOTA↑(%)	IDF1↑(%)	MOTP↑	MT↑(%)	ML↓(%)	IDs↓	FN↓	FP↓	Hz↑
MPNTrack^[19]	58.8	61.7	–	28.8	33.5	1 185	213 594	17 413	6.5
CRF-RNN^[20]	53.1	53.7	76.1	24.2	30.7	2 518	234 991	27 194	1.4
DSORT^[5]	60.3	61.2	79.1	31.5	20.3	2 442	185 301	36 111	20.0
Tracktor++^[21]	53.5	52.3	78.0	19.5	36.6	4 611	248 047	12 201	1.5
CSE-FF^[22]	67.7	56.1	77.9	33.7	22.8	5 706	–	–	30.0
CTracker^[11]	66.6	57.4	78.2	32.2	24.2	5 529	160 491	22 284	34.4
本文	68.1	58.7	78.6	34.0	21.2	5 535	159 474	15 104	29.8
注：粗体为每列最优，斜体为每列次优

下载: 导出CSV

表 2 UAVDT数据集上的实验结果对比

算法	MOTA↑(%)	IDF1↑(%)	MOTP↑	MT↑(%)	ML↓(%)	IDs↓	FN↓	FP↓
IOUT^[23]	36.6	23.7	72.1	534	357	9 938	163 881	42 245
SORT^[3]	39	43.7	74.3	484	400	2 350	172 628	33 037
DAN^[4]	41.6	29.7	72.5	648	367	12 902	–	–
ByteTrack^[24]	39.1	44.7	74.3	–	–	2 341	–	–
CTracker^[11]	40.3	49.2	78.3	296	449	4 280	152 568	50 292
本文	41.4	50.3	78.4	301	441	3 841	154 938	44 330

下载: 导出CSV

表 3 SRFPN不同感受野扩展方法的性能对比

算法	MOTA↑(%)	IDF1↑(%)	Para↓	Hz↑
RFB^[25]	67.4	57.0	13.8	31.5
DAPPM^[26]	67.2	57.5	16.7	26.3
SPPF(本文)	67.5	57.2	13.5	31.7

下载: 导出CSV

表 4 不同算法在上下文信息利用上的性能对比

算法	MOTA↑(%)	IDF1↑(%)	Para↓	Hz↑
DANet^[27]	66.8	57.6	45.6	31.7
ERGM^[28]	67.0	58.0	45.4	32.3
GLFIM(本文)	67.1	58.3	45.7	32.9

下载: 导出CSV

表 5 MOT17数据集上的消融实验效果

算法模块			评价指标
SRFPN	GLFIM	AMM	MOTA↑(%)	IDF1↑(%)	MOTP↑	MT↑(%)	ML↓(%)	IDs↓	FN↓	FP↓	Hz↑
			66.6	57.4	78.2	32.2	24.2	5 529	160 491	22 284	34.4
√			67.5	57.2	78.7	33.9	20.7	5 489	152 579	22 169	31.7
	√		67.1	58.3	78.3	33.3	21.5	5 360	154 915	15 118	32.9
		√	66.6	57.7	78.1	32.5	24.0	5 093	160 497	22 298	33.5
√	√		67.3	58.3	78.5	33.6	21.5	5 499	154 426	21 452	30.9
	√	√	67.1	58.5	78.4	33.5	21.5	5 365	154 936	21 610	32.4
√		√	67.3	57.8	78.5	33.7	21.5	5 415	153 268	23 473	30.6
√	√	√	68.1	58.7	78.6	34.0	21.2	5 535	159 474	15 104	29.8

下载: 导出CSV

表 6 GLFIM引入位置消融实验结果

层级	MOTA↑(%)	IDF1↑(%)	Para↓	Hz↑
Baseline	66.6	57.4	43.6	34.4
P7	66.8	57.6	44.0	34.1
P7,P6	66.8	57.9	44.4	33.9
P7,P6,P5	67.2	57.0	44.9	33.6
P7,P6,P5,P4	67.0	57.8	45.3	33.4
All(本文)	67.1	58.3	45.7	32.9

下载: 导出CSV

参考文献(28)

[1]	张红颖, 贺鹏艺. 基于卷积注意力模块和无锚框检测网络的行人跟踪算法[J]. 电子与信息学报, 2022, 44(9): 3299–3307. doi: 10.11999/JEIT210634. ZHANG Hongying and HE Pengyi. Pedestrian tracking algorithm based on convolutional block attention module and anchor-free detection network[J]. Journal of Electronics & Information Technology, 2022, 44(9): 3299–3307. doi: 10.11999/JEIT210634.
[2]	伍瀚, 聂佳浩, 张照娓, 等. 基于深度学习的视觉多目标跟踪研究综述[J]. 计算机科学, 2023, 50(4): 77–87. doi: 10.11896/jsjkx.220300173. WU Han, NIE Jiahao, ZHANG Zhaowei, et al. Deep learning-based visual multiple object tracking: A review[J]. Computer Science, 2023, 50(4): 77–87. doi: 10.11896/jsjkx.220300173.
[3]	BEWLEY A, GE Zongyuan, OTT L, et al. Simple online and realtime tracking[C]. 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, USA, 2016: 3464–3468. doi: 10.1109/ICIP.2016.7533003.
[4]	SUN Shijie, AKHTAR N, SONG Huansheng, et al. Deep affinity network for multiple object tracking[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(1): 104–119. doi: 10.1109/TPAMI.2019.2929520.
[5]	WOJKE N, BEWLEY A, and PAULUS D. Simple online and realtime tracking with a deep association metric[C]. 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 2017: 3645–3649. doi: 10.1109/ICIP.2017.8296962.
[6]	WANG Zhongdao, ZHENG Liang, LIU Yixuan, et al. Towards real-time multi-object tracking[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 107–122. doi: 10.1007/978-3-030-58621-8_7.
[7]	ZHANG Yifu, WANG Chunyu, WANG Xinggang, et al. FairMOT: On the fairness of detection and re-identification in multiple object tracking[J]. International Journal of Computer Vision, 2021, 129(11): 3069–3087. doi: 10.1007/s11263-021-01513-4.
[8]	YU En, LI Zhuoling, HAN Shoudong, et al. RelationTrack: Relation-aware multiple object tracking with decoupled representation[J]. IEEE Transactions on Multimedia, 2023, 25: 2686–2697. doi: 10.1109/TMM.2022.3150169.
[9]	CHU Peng, WANG Jiang, YOU Quanzeng, et al. TransMOT: Spatial-temporal graph transformer for multiple object tracking[C]. Proceedings of 2023 IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2023: 4859–4869. doi: 10.1109/WACV56688.2023.00485.
[10]	XU Yihong, BAN Yutong, DELORME G, et al. TransCenter: Transformers with dense representations for multiple-object tracking[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 7820–7835. doi: 10.1109/TPAMI.2022.3225078.
[11]	PENG Jinlong, WANG Changan, WAN Fangbin, et al. Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking[C]. 16th European Conference on Computer Vision, Glasgow, UK, 2020: 145–161. doi: 10.1007/978-3-030-58548-8_9.
[12]	PANG Bo, LI Yizhuo, ZHANG Yifan, et al. TubeTK: Adopting tubes to track multi-object in a one-step training model[C]. Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 6307–6317. doi: 10.1109/CVPR42600.2020.00634.
[13]	ZHANG Chuang, ZHENG Sifa, WU Haoran, et al. AttentionTrack: Multiple object tracking in traffic scenarios using features attention[J]. IEEE Transactions on Intelligent Transportation Systems, 2024, 25(2): 1661–1674. doi: 10.1109/TITS.2023.3315222.
[14]	OGAWA T, SHIBATA T, and HOSOI T. FRoG-MOT: Fast and robust generic multiple-object tracking by IoU and motion-state associations[C]. 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, USA, 2024: 6549–6558. doi: 10.1109/WACV57701.2024.00643.
[15]	HUANG Huimin, XIE Shiao, LIN Lanfen, et al. ScaleFormer: Revisiting the transformer-based backbones from a scale-wise perspective for medical image segmentation[C]. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 2022: 964–971. doi: 10.24963/ijcai.2022/135.
[16]	MILAN A, LEAL-TAIXÉ L, REID I, et al. MOT16: A benchmark for multi-object tracking[EB/OL]. https://arxiv.org/abs/1603.00831, 2016.
[17]	DU Dawei, QI Yuankai, YU Hongyang, et al. The unmanned aerial vehicle benchmark: Object detection and tracking[C]. Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 2018: 375–391. doi: 10.1007/978-3-030-01249-6_23.
[18]	LUO Yutong, ZHONG Xinyue, ZENG Minchen, et al. CGLF-Net: Image emotion recognition network by combining global self-attention features and local multiscale features[J]. IEEE Transactions on Multimedia, 2024, 26: 1894–1908. doi: 10.1109/TMM.2023.3289762.
[19]	BRASÓ G and LEAL-TAIXÉ L. Learning a neural solver for multiple object tracking[C]. Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 6246–6256. doi: 10.1109/CVPR42600.2020.00628.
[20]	XIANG Jun, XU Guohan, MA Chao, et al. End-to-end learning deep CRF models for multi-object tracking deep CRF models[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(1): 275–288. doi: 10.1109/TCSVT.2020.2975842.
[21]	BERGMANN P, MEINHARDT T, and LEAL-TAIXÉ L. Tracking without bells and whistles[C]. Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 941–951. doi: 10.1109/ICCV.2019.00103.
[22]	ZHOU Yan, CHEN Junyu, WANG Dongli, et al. Multi-object tracking using context-sensitive enhancement via feature fusion[J]. Multimedia Tools and Applications, 2024, 83(7): 19465–19484. doi: 10.1007/s11042-023-16027-z.
[23]	BOCHINSKI E, EISELEIN V, and SIKORA T. High-speed tracking-by-detection without using image information[C]. 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 2017: 1–6. doi: 10.1109/AVSS.2017.8078516.
[24]	ZHANG Yifu, SUN Peize, JIANG Yi, et al. ByteTrack: Multi-object tracking by associating every detection box[C]. 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 1–21. doi: 10.1007/978-3-031-20047-2_1.
[25]	LIU Songtao, HUANG Di, and WANG Yunhong. Receptive field block net for accurate and fast object detection[C]. Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 2018: 404–419. doi: 10.1007/978-3-030-01252-6_24.
[26]	PAN Huihui, HONG Yuanduo, SUN Weichao, et al. Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes[J]. IEEE Transactions on Intelligent Transportation Systems, 2023, 24(3): 3448–3460. doi: 10.1109/TITS.2022.3228042.
[27]	FU Jun, LIU Jing, TIAN Haijie, et al. Dual attention network for scene segmentation[C]. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3141–3149. doi: 10.1109/CVPR.2019.00326.
[28]	XIE Yakun, ZHU Jun, LAI Jianbo, et al. An enhanced relation-aware global-local attention network for escaping human detection in indoor smoke scenarios[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2022, 186: 140–156. doi: 10.1016/j.isprsjprs.2022.02.006.