一种空间语义联合感知的红外无人机目标跟踪方法

于国栋; 蒋一纯; 刘云清; 王义君; 詹伟达; 王春阳; 冯江海; 韩悦毅

doi:10.11999/JEIT250613

一种空间语义联合感知的红外无人机目标跟踪方法

doi: 10.11999/JEIT250613 cstr: 32379.14.JEIT250613

1.
中国人民解放军63869部队白城 137001
2.
长春理工大学电子信息工程学院长春 130022

基金项目: 吉林省发展与改革委员会创新能力建设专项资助项目(2024C021-8)

详细信息

作者简介:
于国栋：男，高级工程师，研究方向为弹道终点坐标测试

蒋一纯：男，讲师，研究方向为图像处理与深度学习

刘云清：男，教授，博士生导师，研究方向为数字信号处理、自动控制与测试技术等

王义君：男，副教授，研究方向为物联网技术

詹伟达：男，教授，博士生导师，研究方向为数字图像处理、红外图像技术与自动目标识别等

王春阳：男，高级工程师，研究方向为弹道终点坐标测试

冯江海：男，工程师，研究方向为弹道终点坐标测试

韩悦毅：男，博士生，研究方向为目标检测、语义分割与目标跟踪

通讯作者:
王春阳　daya9527@126.com

中图分类号: TP391.41
计量
- 文章访问数: 41
- HTML全文浏览量: 10
- PDF下载量: 6
- 被引次数: 0
出版历程
- 修回日期: 2025-10-14
- 网络出版日期: 2025-10-16

A Spatial-semantic Combine Perception for Infrared UAV Target Tracking

1.
Unit 63869, People’s Liberation Army, Baicheng 137001, China
2.
School of Electronical and Information Engineering, Changchun University of Science and Technology, Changchun 130022, China

Funds: Jilin Province Development and Reform Commission Special Fund for Innovation Capacity Development (2024C021-8)

摘要

摘要: 现有红外无人机目标跟踪方法在多尺度特征融合过程中存在空间和语义信息丢失问题，导致跟踪器无法精准定位无人机目标位置，降低了跟踪任务的成功率。针对上述问题，该文提出了一种空间语义联合感知的红外无人机目标跟踪方法。首先，提出了空间语义联合注意模块，通过空间多尺度注意模块提取多尺度长程依赖特征，增强空间上下文信息的关注，并通过全局-局部通道语义注意模块交互全局和局部通道特征，确保重要语义信息的捕获。其次，设计了双分支全局特征交互模块对模板和搜索分支特征进行有效整合，显著提高了网络的整体性能。在红外无人机数据集Anti-UAV上进行了广泛实验验证，结果表明：与现有方法相比，本方法具有更好的跟踪性能，平均状态精度达到0.769，成功率达到0.743，精确度达到0.935，均优于对比方法，并且有效性、泛化性和先进性也得到了验证。
- 目标跟踪 /
- 语义信息 /
- 空间上下文信息 /
- 长程依赖 /
- 特征交互
Abstract: Objective In recent years, infrared image-based UAV target tracking technology has attracted widespread attention. In real-world scenarios, infrared UAV target tracking still faces significant challenges due to factors such as complex backgrounds, UAV target deformation, and camera movement. Siamese network-based tracking methods have made breakthroughs in balancing tracking accuracy and efficiency. However, existing approaches rely solely on high-level feature outputs from deep networks to predict target positions, neglecting the effective use of low-level features. This leads to the loss of spatial detail features of infrared UAV targets, severely affecting tracking performance. To efficiently utilize low-level features, some methods have incorporated Feature Pyramid Networks (FPN) into the tracking framework, progressively fusing cross-layer feature maps in a top-down manner, thereby effectively enhancing tracking performance for multi-scale targets. Nevertheless, these methods directly adopt traditional FPN channel reduction operations, which result in significant loss of spatial contextual information and channel semantic information. To address the above issues, a novel infrared UAV target tracking method based on spatial-semantic combine perception is proposed. By capturing spatial multi-scale features and channel semantic information, the proposed approach enhances the model's capability to track infrared UAV targets in complex backgrounds. Methods The proposed method comprises four main components: a backbone network, multi-scale feature fusion, template-search feature interaction, and a detection head. Initially, template and search images containing infrared UAV targets are input into a weight-sharing backbone network to extract features. Subsequently, a FPN is constructed, within which a Spatial-semantic Combine Attention Module (SCAM) is integrated to efficiently fuse multi-scale features. Finally, a Dual-branch global Feature interaction Module (DFM) is employed to facilitate feature interaction between the template and search branches, and the final tracking results are obtained through the detection head. The proposed SCAM enhances the network’s focus on spatial and semantic information by jointly leveraging spatial and channel attention mechanisms, thereby mitigating the loss of spatial and semantic information in low-level features caused by channel dimensionality reduction in traditional FPN. SCAM primarily consists of two components: the Spatial Multi-scale Attention module (SMA) and the Global-Local Channel Semantic Attention module (GCSA). The SMA captures long-range multi-scale dependencies efficiently through axial positional embedding and multi-branch grouped feature extraction, thereby improving the network's perception of global contextual information. GCSA adopts a dual-branch design to effectively integrate global and local information across feature channels, suppress irrelevant background noise, and enable more rational channel-wise feature weighting. The proposed DFM treats the template branch features as the query source for the search branch and applies global cross-attention to capture more comprehensive features of infrared UAV targets. This enhances the tracking network’s ability to attend to the spatial location and boundary details of infrared UAV targets. Results and Discussions The proposed method has been validated on the infrared UAV benchmark dataset (Anti-UAV). Quantitative analysis (Table 1) demonstrates that, compared to 10 state-of-the-art methods, the proposed approach achieves the highest average normalized precision score of 76.9%, surpassing the second-best method, LGTrack, by 4.4%. In terms of success rate and localization precision (Fig. 6), the proposed method also outperforms LGTrack by 4.7% and 2.1%, respectively, evidencing its superiority in infrared UAV target tracking. Qualitative analysis (Figs. 7-11) further confirms that the proposed method exhibits strong adaptability and robustness when addressing various typical challenges in infrared UAV tracking, such as occlusion, distracting objects, complex backgrounds, scale variations, and rapid deformations. The collaborative design of the individual modules significantly enhances the model’s ability to perceive and represent small targets and dynamic scenes. In addition, qualitative experiments (Fig. 12) conducted on a self-constructed infrared UAV tracking dataset demonstrate the effectiveness and generalization capability of the proposed method in real-world tracking scenarios. Ablation studies (Tables 2-6) reveal that integrating any individual proposed module consistently improves tracking performance. Compared with the baseline tracker, the integration of all sub-modules leads to improvements of 14.3% in average normalized precision, 12.5% in success rate, and 14.0% in localization precision, thereby verifying the effectiveness of the proposed components. Conclusions This paper conducts a systematic theoretical analysis and experimental validation addressing the issue of spatial and semantic information loss in infrared UAV target tracking. Focusing on the limitations of existing FPN-based infrared UAV tracking methods, particularly the drawbacks associated with channel reduction in multi-scale low-level features, a novel infrared UAV target tracking method based on spatial-semantic combine perception is proposed which fully leverages the complementary advantages of spatial and channel attention mechanisms. This method enhances the network’s focus on spatial context and critical semantic information, thereby improving overall tracking performance. The following main conclusions are obtained: (1) Proposed SCAM combining SMA and GCSA, where SMA captures spatial long-range feature dependencies through position coordinate embedding and one-dimensional convolution operations, ensuring the acquisition of multi-scale contextual information, while GCSA achieves more comprehensive semantic feature attention by interacting local and global channel features. (2) Designed DFM, which realizes feature interaction between search branch features and template branch features through global cross-attention, enabling the dual-branch features to complement each other and enhancing network tracking performance. (3) Extensive experimental results demonstrate that the proposed algorithm outperforms existing advanced methods in both quantitative evaluation and qualitative analysis, with an average state accuracy of 0.769, success rate of 0.743, and precision of 0.935, achieving more accurate tracking of infrared UAV targets. Although the algorithm in this paper has been optimized in terms of computing resource utilization efficiency, further research is needed on efficient deployment strategies for embedded and mobile devices to improve real-time performance and computing adaptability.
- Target tracking /
- Semantic information /
- Spatial context information /
- Long-range dependence /
- Feature interaction

HTML全文

图 1 本文方法的结构图

下载: 全尺寸图片幻灯片

图 2 SCAM结构图

下载: 全尺寸图片幻灯片

图 3 SMA的内部结构

下载: 全尺寸图片幻灯片

图 4 GCSA的内部结构

下载: 全尺寸图片幻灯片

图 5 DFM结构图

下载: 全尺寸图片幻灯片

图 6 本文方法与三个典型算法在目标超出视野范围情况下的跟踪结果可视化

下载: 全尺寸图片幻灯片

图 8 本文方法与三个典型算法在复杂背景情况下的跟踪结果可视化

下载: 全尺寸图片幻灯片

图 7 本文方法与三个典型算法在飞行物干扰情况下的跟踪结果可视化

下载: 全尺寸图片幻灯片

图 9 本文与三个典型跟踪方法在自制数据集上的跟踪结果可视化

下载: 全尺寸图片幻灯片

表 1 所有跟踪方法的定量比较结果。其中，粗体表示最优结果，下划线表示次优结果

跟踪方法	平均状态精度	成功率	精确度	FPS
SiamCAR^[2]	0.250	0.236	0.289	55.7
Ocean^[3]	0.248	0.235	0.291	43.1
OSTrack^[22]	0.352	0.334	0.423	46.4
GRM^[21]	0.366	0.344	0.429	13.5
AiAtrack^[6]	0.481	0.459	0.584	39.2
GlobalTrack^[20]	0.553	0.532	0.711	9.7
SiamYOLO^[9]	0.617	0.589	0.789	37.1
Unicorn^[7]	0.637	0.621	0.801	29.2
EANTrack^[23]	0.698	0.677	0.868	43.6
LGTrack^[24]	0.725	0.696	0.914	25
本文方法	0.769	0.743	0.935	34.8

下载: 导出CSV

表 2 不同注意力模块的性能、参数量和计算量的对比结果

评价指标	SE^[15]	CBAM^[25]	CPCA^[26]	SCAM
参数量(M)	2.524	2.525	1.846	1.460
FLOPs(G)	0.105	0.106	0.859	0.657
平均状态精度	0.643	0.684	0.698	0.723
成功率	0.632	0.677	0.680	0.703
精确度	0.813	0.835	0.841	0.876

下载: 导出CSV

表 3 SMA设置不同卷积核对网络性能的影响

卷积核设置	平均状态精度	成功率	精确度
(3，3，3，3)	0.665	0.636	0.827
(5，5，5，5)	0.662	0.633	0.826
(7，7，7，7)	0.654	0.627	0.813
(3，5，7，9)	0.677	0.653	0.849
(3，7，11，15)	0.657	0.632	0.817

下载: 导出CSV

表 4 GCSA的消融实验

序号	单/双分支	全局	局部	平均状态精度	成功率	精确度
1	单分支	√	×	0.629	0.617	0.808
2	单分支	×	√	0.633	0.626	0.812
3	双分支	√	×	0.641	0.630	0.822
4	双分支	×	√	0.646	0.635	0.827
5	双分支	√	√	0.659	0.647	0.838

下载: 导出CSV

表 5 DFM的有效性验证

模块设置	平均状态精度	成功率	精确度
Add	0.626	0.618	0.795
Concat	0.630	0.623	0.803
DFM	0.664	0.650	0.851

下载: 导出CSV

参考文献(26)

[1]	聂伟, 张中洋, 杨小龙, 等. 基于梅尔倒谱系数的无人机探测与识别方法[J]. 电子与信息学报, 2025, 47(4): 1076–1084. doi: 10.11999/JEIT241111. NIE Wei, ZHANG Zhongyang, YANG Xiaolong, et al. Unmanned aerial vehicles detection and recognition method based on Mel frequency cepstral coefficients[J]. Journal of Electronics & Information Technology, 2025, 47(4): 1076–1084. doi: 10.11999/JEIT241111.
[2]	GUO Dongyan, WANG Jun, CUI Ying, et al. SiamCAR: Siamese fully convolutional classification and regression for visual tracking[C]. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 6268–6276. doi: 10.1109/CVPR42600.2020.00630.
[3]	ZHANG Zhipeng, PENG Houwen, FU Jianlong, et al. Ocean: Object-aware anchor-free tracking[C]. Proceedings of 16th European Conference on Computer Vision – ECCV 2020, Glasgow, UK, 2020: 771–787. doi: 10.1007/978-3-030-58589-1_46.
[4]	侯志强, 王卓, 马素刚, 等. 长时视觉跟踪中基于双模板Siamese结构的目标漂移判定网络[J]. 电子与信息学报, 2024, 46(4): 1458–1467. doi: 10.11999/JEIT230496. HOU Zhiqiang, WANG Zhuo, MA Sugang, et al. Target drift discriminative network based on dual-template Siamese structure in long-term tracking[J]. Journal of Electronics & Information Technology, 2024, 46(4): 1458–1467. doi: 10.11999/JEIT230496.
[5]	JIANG Nan, WANG Kuiran, PENG Xiaoke, et al. Anti-UAV: A large multi-modal benchmark for UAV tracking[J]. arXiv preprint arXiv: 2101.08466, 2021. doi: 10.48550/arXiv.2101.08466. (查阅网上资料,不确定本文献类型是否正确,请确认).
[6]	GAO Shenyuan, ZHOU Chunluan, MA Chao, et al. AiATrack: Attention in attention for transformer visual tracking[C]. Proceedings of the 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, 2022: 146–164. doi: 10.1007/978-3-031-20047-2_9.
[7]	YAN Bin, JIANG Yi, SUN Peize, et al. Towards grand unification of object tracking[C]. Proceedings of 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, 2022: 733–751. doi: 10.1007/978-3-031-19803-8_43.
[8]	计忠平, 王相威, 何志伟, 等. 集成全局局部特征交互与角动量机制的端到端多目标跟踪算法[J]. 电子与信息学报, 2024, 46(9): 3703–3712. doi: 10.11999/JEIT240277. JI Zhongping, WANG Xiangwei, HE Zhiwei, et al. End-to-end multi-object tracking algorithm integrating global local feature interaction and angular momentum mechanism[J]. Journal of Electronics & Information Technology, 2024, 46(9): 3703–3712. doi: 10.11999/JEIT240277.
[9]	FANG Houzhang, WANG Xiaolin, LIAO Zikai, et al. A real-time anti-distractor infrared UAV tracker with channel feature refinement module[C]. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops, Montreal, Canada, 2021: 1240. doi: 10.1109/ICCVW54120.2021.00144.
[10]	李华耀, 钟小勇, 杨智能, 等. 结合孪生网络和Transformer的轻量级无人机目标跟踪算法[J]. 电光与控制, 2025, 32(6): 31–37. doi: 10.3969/j.issn.1671-637X.2025.06.005. LI Huayao, ZHONG Xiaoyong, YANG Zhineng, et al. A lightweight UAV tracking algorithm combining Siamese network with Transformer[J]. Electronics Optics & Control, 2025, 32(6): 31–37. doi: 10.3969/j.issn.1671-637X.2025.06.005.
[11]	齐咏生, 姜政廷, 刘利强, 等. SiamMT: 基于自适应特征融合机制的可修正RGBT目标跟踪算法[J]. 控制与决策, 2025, 40(4): 1312–1320. doi: 10.13195/j.kzyjc.2024.0205. QI Yongsheng, JIANG Zhengting, LIU Liqiang, et al. SiamMT: Amodifiable RGBT target tracking algorithm based on adaptive feature fusion mechanism[J]. Control and Decision, 2025, 40(4): 1312–1320. doi: 10.13195/j.kzyjc.2024.0205.
[12]	SHAN Yunxiao, ZHOU Xiaomei, LIU Shanghua, et al. SiamFPN: A deep learning method for accurate and real-time maritime ship tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(1): 315–325. doi: 10.1109/TCSVT.2020.2978194.
[13]	LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 936–944. doi: 10.1109/CVPR.2017.106.
[14]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.
[15]	HU Jie, SHEN Li, and SUN Gang. Squeeze-and-excitation networks[C]. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141. doi: 10.1109/CVPR.2018.00745.
[16]	SHEN Zhuoran, ZHANG Mingyuan, ZHAO Haiyu, et al. Efficient attention: Attention with linear complexities[C]. Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2021: 3530–3538. doi: 10.1109/WACV48630.2021.00357.
[17]	JOCHER G, STOKEN A, BOROVEC J, et al. “YOLOv5, ” https://github.com/ultralytics/yolov5, 2021. (查阅网上资料,未找到本条文献信息,请确认).
[18]	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]. Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 2999–3007. doi: 10.1109/ICCV.2017.324.
[19]	REZATOFIGHI H, TSOI N, GWAK J, et al. Generalized intersection over union: A metric and a loss for bounding box regression[C]. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 658–666. doi: 10.1109/CVPR.2019.00075.
[20]	HUANG Lianghua, ZHAO Xin, and HUANG Kaiqi. GlobalTrack: A simple and strong baseline for long-term tracking[C]. The Thirty-Fourth AAAI Conference on Artificial Intelligence, Palo Alto, USA, 2020: 11037–11044. doi: 10.1609/aaai.v34i07.6758.
[21]	GAO Shenyuan, ZHOU Chunluan, and ZHANG Jun. Generalized relation modeling for transformer tracking[C]. Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 18686–18695. doi: 10.1109/CVPR52729.2023.01792.
[22]	YE Botao, CHANG Hong, MA Bingpeng, et al. Joint feature learning and relation modeling for tracking: A one-stream framework[C]. Proceedings of 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, 2022: 341–357. doi: 10.1007/978-3-031-20047-2_20.
[23]	GU Fengwei, LU Jun, CAI Chengtao, et al. EANTrack: An efficient attention network for visual tracking[J]. IEEE Transactions on Automation Science and Engineering, 2024, 21(4): 5911–5928. doi: 10.1109/TASE.2023.3319676.
[24]	LIU Chang, ZHAO Jie, BO Chunjuan, et al. LGTrack: Exploiting local and global properties for robust visual tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(9): 8161–8171. doi: 10.1109/TCSVT.2024.3390054.
[25]	WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional block attention module[C]. Proceedings of 15th European Conference on Computer Vision – ECCV, Munich, Germany, 2018: 3–19. doi: 10.1007/978-3-030-01234-2_1.
[26]	HUANG Hejun, CHEN Zuguo, ZOU Ying, et al. Channel prior convolutional attention for medical image segmentation[J]. Computers in Biology and Medicine, 2024, 178: 108784. doi: 10.1016/j.compbiomed.2024.108784.