A Spatial-semantic Combine Perception for Infrared UAV Target Tracking
-
摘要: 现有红外无人机目标跟踪方法在多尺度特征融合过程中存在空间和语义信息丢失问题,导致跟踪器无法精准定位无人机目标位置,降低了跟踪任务的成功率。针对上述问题,该文提出了一种空间语义联合感知的红外无人机目标跟踪方法。首先,提出了空间语义联合注意模块,通过空间多尺度注意模块提取多尺度长程依赖特征,增强空间上下文信息的关注,并通过全局-局部通道语义注意模块交互全局和局部通道特征,确保重要语义信息的捕获。其次,设计了双分支全局特征交互模块对模板和搜索分支特征进行有效整合,显著提高了网络的整体性能。在红外无人机数据集Anti-UAV上进行了广泛实验验证,结果表明:与现有方法相比,本方法具有更好的跟踪性能,平均状态精度达到0.769,成功率达到0.743,精确度达到0.935,均优于对比方法,并且有效性、泛化性和先进性也得到了验证。Abstract:
Objective In recent years, infrared image-based UAV target tracking technology has attracted widespread attention. In real-world scenarios, infrared UAV target tracking still faces significant challenges due to factors such as complex backgrounds, UAV target deformation, and camera movement. Siamese network-based tracking methods have made breakthroughs in balancing tracking accuracy and efficiency. However, existing approaches rely solely on high-level feature outputs from deep networks to predict target positions, neglecting the effective use of low-level features. This leads to the loss of spatial detail features of infrared UAV targets, severely affecting tracking performance. To efficiently utilize low-level features, some methods have incorporated Feature Pyramid Networks (FPN) into the tracking framework, progressively fusing cross-layer feature maps in a top-down manner, thereby effectively enhancing tracking performance for multi-scale targets. Nevertheless, these methods directly adopt traditional FPN channel reduction operations, which result in significant loss of spatial contextual information and channel semantic information. To address the above issues, a novel infrared UAV target tracking method based on spatial-semantic combine perception is proposed. By capturing spatial multi-scale features and channel semantic information, the proposed approach enhances the model's capability to track infrared UAV targets in complex backgrounds. Methods The proposed method comprises four main components: a backbone network, multi-scale feature fusion, template-search feature interaction, and a detection head. Initially, template and search images containing infrared UAV targets are input into a weight-sharing backbone network to extract features. Subsequently, a FPN is constructed, within which a Spatial-semantic Combine Attention Module (SCAM) is integrated to efficiently fuse multi-scale features. Finally, a Dual-branch global Feature interaction Module (DFM) is employed to facilitate feature interaction between the template and search branches, and the final tracking results are obtained through the detection head. The proposed SCAM enhances the network’s focus on spatial and semantic information by jointly leveraging spatial and channel attention mechanisms, thereby mitigating the loss of spatial and semantic information in low-level features caused by channel dimensionality reduction in traditional FPN. SCAM primarily consists of two components: the Spatial Multi-scale Attention module (SMA) and the Global-Local Channel Semantic Attention module (GCSA). The SMA captures long-range multi-scale dependencies efficiently through axial positional embedding and multi-branch grouped feature extraction, thereby improving the network's perception of global contextual information. GCSA adopts a dual-branch design to effectively integrate global and local information across feature channels, suppress irrelevant background noise, and enable more rational channel-wise feature weighting. The proposed DFM treats the template branch features as the query source for the search branch and applies global cross-attention to capture more comprehensive features of infrared UAV targets. This enhances the tracking network’s ability to attend to the spatial location and boundary details of infrared UAV targets. Results and Discussions The proposed method has been validated on the infrared UAV benchmark dataset (Anti-UAV). Quantitative analysis ( Table 1 ) demonstrates that, compared to 10 state-of-the-art methods, the proposed approach achieves the highest average normalized precision score of 76.9%, surpassing the second-best method, LGTrack, by 4.4%. In terms of success rate and localization precision (Fig. 6 ), the proposed method also outperforms LGTrack by 4.7% and 2.1%, respectively, evidencing its superiority in infrared UAV target tracking. Qualitative analysis (Figs. 7 -11 ) further confirms that the proposed method exhibits strong adaptability and robustness when addressing various typical challenges in infrared UAV tracking, such as occlusion, distracting objects, complex backgrounds, scale variations, and rapid deformations. The collaborative design of the individual modules significantly enhances the model’s ability to perceive and represent small targets and dynamic scenes. In addition, qualitative experiments (Fig. 12 ) conducted on a self-constructed infrared UAV tracking dataset demonstrate the effectiveness and generalization capability of the proposed method in real-world tracking scenarios. Ablation studies (Tables 2 -6 ) reveal that integrating any individual proposed module consistently improves tracking performance. Compared with the baseline tracker, the integration of all sub-modules leads to improvements of 14.3% in average normalized precision, 12.5% in success rate, and 14.0% in localization precision, thereby verifying the effectiveness of the proposed components.Conclusions This paper conducts a systematic theoretical analysis and experimental validation addressing the issue of spatial and semantic information loss in infrared UAV target tracking. Focusing on the limitations of existing FPN-based infrared UAV tracking methods, particularly the drawbacks associated with channel reduction in multi-scale low-level features, a novel infrared UAV target tracking method based on spatial-semantic combine perception is proposed which fully leverages the complementary advantages of spatial and channel attention mechanisms. This method enhances the network’s focus on spatial context and critical semantic information, thereby improving overall tracking performance. The following main conclusions are obtained: (1) Proposed SCAM combining SMA and GCSA, where SMA captures spatial long-range feature dependencies through position coordinate embedding and one-dimensional convolution operations, ensuring the acquisition of multi-scale contextual information, while GCSA achieves more comprehensive semantic feature attention by interacting local and global channel features. (2) Designed DFM, which realizes feature interaction between search branch features and template branch features through global cross-attention, enabling the dual-branch features to complement each other and enhancing network tracking performance. (3) Extensive experimental results demonstrate that the proposed algorithm outperforms existing advanced methods in both quantitative evaluation and qualitative analysis, with an average state accuracy of 0.769, success rate of 0.743, and precision of 0.935, achieving more accurate tracking of infrared UAV targets. Although the algorithm in this paper has been optimized in terms of computing resource utilization efficiency, further research is needed on efficient deployment strategies for embedded and mobile devices to improve real-time performance and computing adaptability. -
表 1 所有跟踪方法的定量比较结果。其中,粗体表示最优结果,下划线表示次优结果
跟踪方法 平均状态精度 成功率 精确度 FPS SiamCAR[2] 0.250 0.236 0.289 55.7 Ocean[3] 0.248 0.235 0.291 43.1 OSTrack[22] 0.352 0.334 0.423 46.4 GRM[21] 0.366 0.344 0.429 13.5 AiAtrack[6] 0.481 0.459 0.584 39.2 GlobalTrack[20] 0.553 0.532 0.711 9.7 SiamYOLO[9] 0.617 0.589 0.789 37.1 Unicorn[7] 0.637 0.621 0.801 29.2 EANTrack[23] 0.698 0.677 0.868 43.6 LGTrack[24] 0.725 0.696 0.914 25 本文方法 0.769 0.743 0.935 34.8 表 2 不同注意力模块的性能、参数量和计算量的对比结果
表 3 SMA设置不同卷积核对网络性能的影响
卷积核设置 平均状态精度 成功率 精确度 (3,3,3,3) 0.665 0.636 0.827 (5,5,5,5) 0.662 0.633 0.826 (7,7,7,7) 0.654 0.627 0.813 (3,5,7,9) 0.677 0.653 0.849 (3,7,11,15) 0.657 0.632 0.817 表 4 GCSA的消融实验
序号 单/双分支 全局 局部 平均状态精度 成功率 精确度 1 单分支 √ × 0.629 0.617 0.808 2 单分支 × √ 0.633 0.626 0.812 3 双分支 √ × 0.641 0.630 0.822 4 双分支 × √ 0.646 0.635 0.827 5 双分支 √ √ 0.659 0.647 0.838 表 5 DFM的有效性验证
模块设置 平均状态精度 成功率 精确度 Add 0.626 0.618 0.795 Concat 0.630 0.623 0.803 DFM 0.664 0.650 0.851 -
[1] 聂伟, 张中洋, 杨小龙, 等. 基于梅尔倒谱系数的无人机探测与识别方法[J]. 电子与信息学报, 2025, 47(4): 1076–1084. doi: 10.11999/JEIT241111.NIE Wei, ZHANG Zhongyang, YANG Xiaolong, et al. Unmanned aerial vehicles detection and recognition method based on Mel frequency cepstral coefficients[J]. Journal of Electronics & Information Technology, 2025, 47(4): 1076–1084. doi: 10.11999/JEIT241111. [2] GUO Dongyan, WANG Jun, CUI Ying, et al. SiamCAR: Siamese fully convolutional classification and regression for visual tracking[C]. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 6268–6276. doi: 10.1109/CVPR42600.2020.00630. [3] ZHANG Zhipeng, PENG Houwen, FU Jianlong, et al. Ocean: Object-aware anchor-free tracking[C]. Proceedings of 16th European Conference on Computer Vision – ECCV 2020, Glasgow, UK, 2020: 771–787. doi: 10.1007/978-3-030-58589-1_46. [4] 侯志强, 王卓, 马素刚, 等. 长时视觉跟踪中基于双模板Siamese结构的目标漂移判定网络[J]. 电子与信息学报, 2024, 46(4): 1458–1467. doi: 10.11999/JEIT230496.HOU Zhiqiang, WANG Zhuo, MA Sugang, et al. Target drift discriminative network based on dual-template Siamese structure in long-term tracking[J]. Journal of Electronics & Information Technology, 2024, 46(4): 1458–1467. doi: 10.11999/JEIT230496. [5] JIANG Nan, WANG Kuiran, PENG Xiaoke, et al. Anti-UAV: A large multi-modal benchmark for UAV tracking[J]. arXiv preprint arXiv: 2101.08466, 2021. doi: 10.48550/arXiv.2101.08466. (查阅网上资料,不确定本文献类型是否正确,请确认). [6] GAO Shenyuan, ZHOU Chunluan, MA Chao, et al. AiATrack: Attention in attention for transformer visual tracking[C]. Proceedings of the 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, 2022: 146–164. doi: 10.1007/978-3-031-20047-2_9. [7] YAN Bin, JIANG Yi, SUN Peize, et al. Towards grand unification of object tracking[C]. Proceedings of 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, 2022: 733–751. doi: 10.1007/978-3-031-19803-8_43. [8] 计忠平, 王相威, 何志伟, 等. 集成全局局部特征交互与角动量机制的端到端多目标跟踪算法[J]. 电子与信息学报, 2024, 46(9): 3703–3712. doi: 10.11999/JEIT240277.JI Zhongping, WANG Xiangwei, HE Zhiwei, et al. End-to-end multi-object tracking algorithm integrating global local feature interaction and angular momentum mechanism[J]. Journal of Electronics & Information Technology, 2024, 46(9): 3703–3712. doi: 10.11999/JEIT240277. [9] FANG Houzhang, WANG Xiaolin, LIAO Zikai, et al. A real-time anti-distractor infrared UAV tracker with channel feature refinement module[C]. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops, Montreal, Canada, 2021: 1240. doi: 10.1109/ICCVW54120.2021.00144. [10] 李华耀, 钟小勇, 杨智能, 等. 结合孪生网络和Transformer的轻量级无人机目标跟踪算法[J]. 电光与控制, 2025, 32(6): 31–37. doi: 10.3969/j.issn.1671-637X.2025.06.005.LI Huayao, ZHONG Xiaoyong, YANG Zhineng, et al. A lightweight UAV tracking algorithm combining Siamese network with Transformer[J]. Electronics Optics & Control, 2025, 32(6): 31–37. doi: 10.3969/j.issn.1671-637X.2025.06.005. [11] 齐咏生, 姜政廷, 刘利强, 等. SiamMT: 基于自适应特征融合机制的可修正RGBT目标跟踪算法[J]. 控制与决策, 2025, 40(4): 1312–1320. doi: 10.13195/j.kzyjc.2024.0205.QI Yongsheng, JIANG Zhengting, LIU Liqiang, et al. SiamMT: Amodifiable RGBT target tracking algorithm based on adaptive feature fusion mechanism[J]. Control and Decision, 2025, 40(4): 1312–1320. doi: 10.13195/j.kzyjc.2024.0205. [12] SHAN Yunxiao, ZHOU Xiaomei, LIU Shanghua, et al. SiamFPN: A deep learning method for accurate and real-time maritime ship tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(1): 315–325. doi: 10.1109/TCSVT.2020.2978194. [13] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 936–944. doi: 10.1109/CVPR.2017.106. [14] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90. [15] HU Jie, SHEN Li, and SUN Gang. Squeeze-and-excitation networks[C]. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141. doi: 10.1109/CVPR.2018.00745. [16] SHEN Zhuoran, ZHANG Mingyuan, ZHAO Haiyu, et al. Efficient attention: Attention with linear complexities[C]. Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2021: 3530–3538. doi: 10.1109/WACV48630.2021.00357. [17] JOCHER G, STOKEN A, BOROVEC J, et al. “YOLOv5, ” https://github.com/ultralytics/yolov5, 2021. (查阅网上资料,未找到本条文献信息,请确认). [18] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]. Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 2999–3007. doi: 10.1109/ICCV.2017.324. [19] REZATOFIGHI H, TSOI N, GWAK J, et al. Generalized intersection over union: A metric and a loss for bounding box regression[C]. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 658–666. doi: 10.1109/CVPR.2019.00075. [20] HUANG Lianghua, ZHAO Xin, and HUANG Kaiqi. GlobalTrack: A simple and strong baseline for long-term tracking[C]. The Thirty-Fourth AAAI Conference on Artificial Intelligence, Palo Alto, USA, 2020: 11037–11044. doi: 10.1609/aaai.v34i07.6758. [21] GAO Shenyuan, ZHOU Chunluan, and ZHANG Jun. Generalized relation modeling for transformer tracking[C]. Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 18686–18695. doi: 10.1109/CVPR52729.2023.01792. [22] YE Botao, CHANG Hong, MA Bingpeng, et al. Joint feature learning and relation modeling for tracking: A one-stream framework[C]. Proceedings of 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, 2022: 341–357. doi: 10.1007/978-3-031-20047-2_20. [23] GU Fengwei, LU Jun, CAI Chengtao, et al. EANTrack: An efficient attention network for visual tracking[J]. IEEE Transactions on Automation Science and Engineering, 2024, 21(4): 5911–5928. doi: 10.1109/TASE.2023.3319676. [24] LIU Chang, ZHAO Jie, BO Chunjuan, et al. LGTrack: Exploiting local and global properties for robust visual tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(9): 8161–8171. doi: 10.1109/TCSVT.2024.3390054. [25] WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional block attention module[C]. Proceedings of 15th European Conference on Computer Vision – ECCV, Munich, Germany, 2018: 3–19. doi: 10.1007/978-3-030-01234-2_1. [26] HUANG Hejun, CHEN Zuguo, ZOU Ying, et al. Channel prior convolutional attention for medical image segmentation[J]. Computers in Biology and Medicine, 2024, 178: 108784. doi: 10.1016/j.compbiomed.2024.108784. -