高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

结合双层路由感知和散射视觉变换的视觉-语言跟踪方法

刘仲民 李振华 胡文瑾

黄长强, 赵克新. 基于改进蚁狮算法的无人机三维航迹规划[J]. 电子与信息学报, 2018, 40(7): 1532-1538. doi: 10.11999/JEIT170961
引用本文: 刘仲民, 李振华, 胡文瑾. 结合双层路由感知和散射视觉变换的视觉-语言跟踪方法[J]. 电子与信息学报, 2024, 46(11): 4236-4246. doi: 10.11999/JEIT240257
HUANG Changqiang, ZHAO Kexin. Three Dimensional Path Planning of UAV with Improved Ant Lion Optimizer[J]. Journal of Electronics & Information Technology, 2018, 40(7): 1532-1538. doi: 10.11999/JEIT170961
Citation: LIU Zhongmin, LI Zhenhua, HU Wenjin. Vision-Language Tracking Method Combining Bi-level Routing Perception and Scattered Vision Transformation[J]. Journal of Electronics & Information Technology, 2024, 46(11): 4236-4246. doi: 10.11999/JEIT240257

结合双层路由感知和散射视觉变换的视觉-语言跟踪方法

doi: 10.11999/JEIT240257
基金项目: 国家自然科学基金(62061042),甘肃省自然科学基金(23JRRA796),甘肃省工业过程先进控制重点实验室开发基金项目(2022KX10)
详细信息
    作者简介:

    刘仲民:男,副教授,研究方向为模式识别、图像修复、图像描述

    李振华:男,硕士生,研究方向为目标跟踪

    胡文瑾:女,教授,研究方向为图像修复、图像质量评价

    通讯作者:

    刘仲民 liuzhmx@163.com

  • 中图分类号: TN911.73; TP391.4

Vision-Language Tracking Method Combining Bi-level Routing Perception and Scattered Vision Transformation

Funds: The National Natural Science Foundation of China(62061042), The Natural Science Foundation of Gansu Province(23JRRA796), The Open Fund Project of the Key Laboratory of Gansu Advanced Control for Industrial Processes (2022KX10)
  • 摘要: 针对视觉-语言关系建模中存在感受野有限和特征交互不充分问题,该文提出一种结合双层路由感知和散射视觉变换的视觉-语言跟踪框架(BPSVTrack)。首先,设计了一种双层路由感知模块(BRPM),通过将高效的加性注意力(EAA)与双动态自适应模块(DDAM)并行结合起来进行双向交互来扩大感受野,使模型更加高效地整合不同窗口和尺寸之间的特征,从而提高模型在复杂场景中对目标的感知能力。其次,通过引入基于双树复小波变换(DTCWT)的散射视觉变换模块(SVTM),将图像分解为低频和高频信息,以此来捕获图像中目标结构和细粒度信息,从而提高模型在复杂环境下的鲁棒性和准确性。在OTB99, LaSOT, TNL2K 3个跟踪数据集上分别取得了86.1%, 64.4%, 63.2%的精度,在RefCOCOg数据集上取得了70.21%的准确率,在跟踪和定位方面的性能均优于基准模型。
  • 图  1  结合双层路由感知和散射视觉变换的视觉-语言跟踪框架

    图  2  双层路由感知模块

    图  3  高效的加性注意力

    图  4  双动态自适应模块

    图  5  标记压缩-增强器

    图  6  散射视觉变换模块

    图  7  3种具有挑战性序列上的定性比较

    表  1  模型的3种变体在数据集LaSOT和TNL2K上的AUC和Pre

    变体 LaSOT TNL2K
    AUC Pre AUC Pre
    JointNLT 0.569 0.593 0.546 0.550
    JointNLT +BRPM 0.547 0.569 0.521 0.516
    JointNLT +SVT 0.562 0.580 0.543 0.539
    JointNLT +BRPM+SVT 0.574 0.612 0.550 0.563
    下载: 导出CSV

    表  2  双层路由感知模块在LaSOT和TNL2K上的AUC和Pre

    模型 LaSOT TNL2K
    AUC Pre AUC Pre
    BRPM 0.547 0.569 0.521 0.516
    BRPM-FI 0.538 0.560 0.517 0.504
    EAA-O 0.537 0.554 0.513 0.507
    DDAM-O 0.540 0.559 0.517 0.510
    BRPM-STE 0.539 0.564 0.515 0.512
    下载: 导出CSV

    表  3  标记压缩-增强模块在数据集LaSOT和TNL2K上的PRE和P

    模型 LsSOT TNL2K 模型
    PRE PRE
    STE-S 0.569 0.516 155.4M
    STE-NS 0.563 0.511 155.9M
    下载: 导出CSV

    表  4  分离方法和联合方法以及定位和跟踪之间的比较

    分离的方法 联合的方法
    VLTVG+STARK VTLVG+OSTrack SepRM JointNLT BPSVTrack
    FLOPs 定位 39.6G 39.6G 34.7G 34.9G 35.9G
    跟踪 20.4G 48.3G 38.5G 42.0G 43.1G
    fps 定位 28.2 ms 28.2 ms 26.4 ms 34.8 ms 36.0 ms
    跟踪 22.9 ms 8.3 ms 20.6 ms 25.3 ms 28.4 ms
    P 总量 169.8M 214.7M 214.4M 153.0M 155.4M
    AUC LaSOT 0.446 0.524 0.518 0.569 0.574
    TNL2K 0.373 0.399 0.491 0.546 0.550
    下载: 导出CSV

    表  5  不同方法在数据集OTB99, LaSOT和TNL2K上的AUC和Pre

    方法 来源 初始化方式 OTB99 LaSOT TNL2K
    AUC Pre AUC Pre AUC Pre
    AutoMatch[27] ICCV21 BB 0.583 0.599 0.472 0.435
    TrDiMP[28] CVPR21 BB 0.639 0.663 0.523 0.528
    TransT[29] CVPR21 BB 0.649 0.690 0.507 0.517
    STARK[26] ICCV21 BB 0.671 0.712
    KeepTrack[30] ICCV21 BB 0.671 0.702
    SwinTrack-B[31] NeurIPS22 BB 0.696 0.741
    OSTrack-384[14] ECCV2022 BB 0.711 0.776 0.559
    TNLS-II[15] CVPR17 NL 0.250 0.290
    RTTNLD[17] WACV20 NL 0.540 0.780 0.280 0.280
    GTI[16] TCSVT20 NL 0.581 0.732 0.478 0.476
    TNL2K-1[3] CVPR21 NL 0.190 0.240 0.510 0.490 0.110 0.060
    CTRNLT[4] CVPR22 NL 0.530 0.720 0.520 0.510 0.140 0.090
    JointNLT CVPR23 NL 0.592 0.776 0.569 0.593 0.546 0.550
    BPSVTrack 本文 NL 0.603 0.786 0.574 0.612 0.550 0.563
    TNLS-III[15] CVPR17 NL+BB 0.550 0.720
    RTTNLD WACV20 NL+BB 0.610 0.790 0.350 0.350 0.250 0.270
    TNL2K-2[3] CVPR21 NL+BB 0.680 0.880 0.510 0.550 0.420 0.420
    SNLT[5] CVPR21 NL+BB 0.666 0.804 0.540 0.576 0.276 0.419
    VLTTT[3] NeurIPS22 NL+BB 0.764 0.931 0.673 0.721 0.531 0.533
    JointNLT CVPR23 NL+BB 0.653 0.856 0.604 0.636 0.569 0.581
    BPSVTrack 本文 NL+BB 0.664 0.861 0.621 0.644 0.609 0.632
    下载: 导出CSV

    表  6  在RefCOCOg验证集上与视觉定位算法的比较

    方法NMTree[32]LBYL-Net[33]ReSC-Large[34]TransVG[35]VLTVG[25]JointNLTBPSVTrack
    准确率0.6180.62700.63120.67020.72980.70070.7021
    下载: 导出CSV
  • [1] GUO Mingzhe, ZHANG Zhipeng, JING Liping, et al. Divert more attention to vision-language object tracking[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence. doi: 10.1109/TPAMI.2024.3409078.
    [2] 许廷发, 王颖, 史国凯, 等. 深度学习单目标跟踪方法的基础架构研究进展[J]. 光学学报, 2023, 43(15): 1510003. doi: 10.3788/AOS230746.

    XU Tingfa, WANG Ying, SHI Guokai, et al. Research progress in fundamental architecture of deep learning-based single object tracking method[J]. Acta Optica Sinica, 2023, 43(15): 1510003. doi: 10.3788/AOS230746.
    [3] WANG Xiao, SHU Xiujun, ZHANG Zhipeng, et al. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 13758–13768. doi: 10.1109/cvpr46437.2021.01355.
    [4] LI Yihao, YU Jun, CAI Zhongpeng, et al. Cross-modal target retrieval for tracking by natural language[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, New Orleans, USA, 2022: 4927–4936. doi: 10.1109/cvprw56347.2022.00540.
    [5] FENG Qi, ABLAVSKY V, BAI Qinxun, et al. Siamese natural language tracker: Tracking by natural language descriptions with Siamese trackers[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 5847–5856. doi: 10.1109/cvpr46437.2021.00579.
    [6] ZHENG Yaozong, ZHONG Bineng, LIANG Qihua, et al. Toward unified token learning for vision-language tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(4): 2125–2135. doi: 10.1109/TCSVT.2023.3301933.
    [7] ZHAO Haojie, WANG Xiao, WANG Dong, et al. Transformer vision-language tracking via proxy token guided cross-modal fusion[J]. Pattern Recognition Letters, 2023, 168: 10–16. doi: 10.1016/j.patrec.2023.02.023.
    [8] ZHOU Li, ZHOU Zikun, MAO Kaige, et al. Joint visual grounding and tracking with natural language specification[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 23151–23160. doi: 10.1109/cvpr52729.2023.02217.
    [9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
    [10] SONG Zikai, LUO Run, YU Junqing, et al. Compact transformer tracker with correlative masked modeling[C]. The 37th AAAI Conference on Artificial Intelligence, Washington, USA, 2023: 2321–2329. doi: 10.1609/aaai.v37i2.25327.
    [11] WANG Yuanyun, ZHANG Wenshuang, LAI Changwang, et al. Adaptive temporal feature modeling for visual tracking via cross-channel learning[J]. Knowledge-Based Systems, 2023, 265: 110380. doi: 10.1016/j.knosys.2023.110380.
    [12] ZHAO Moju, OKADA K, and INABA M. TrTr: Visual tracking with transformer[J]. arXiv: 2105.03817, 2021. doi: 10.48550/arXiv.2105.03817.
    [13] TANG Chuanming, WANG Xiao, BAI Yuanchao, et al. Learning spatial-frequency transformer for visual object tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9): 5102–5116. doi: 10.1109/tcsvt.2023.3249468.
    [14] YE Botao, CHANG Hong, MA Bingpeng, et al. Joint feature learning and relation modeling for tracking: A one-stream framework[C]. The 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 341–357. doi: 10.1007/978-3-031-20047-2_20.
    [15] LI Zhenyang, TAO Ran, GAVVES E, et al. Tracking by natural language specification[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 7350–7358. doi: 10.1109/cvpr.2017.777.
    [16] YANG Zhengyuan, KUMAR T, CHEN Tianlang, et al. Grounding-tracking-integration[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(9): 3433–3443. doi: 10.1109/tcsvt.2020.3038720.
    [17] FENG Qi, ABLAVSKY V, BAI Qinxun, et al. Real-time visual object tracking with natural language description[C]. 2020 IEEE Winter Conference on Applications of Computer Vision, Snowmass, USA, 2020: 689–698. doi: 10.1109/wacv45572.2020.9093425.
    [18] ZHANG Xin, SONG Yingze, SONG Tingting, et al. AKConv: Convolutional kernel with arbitrary sampled shapes and arbitrary number of parameters[J]. arXiv: 2311.11587, 2023. doi: 10.48550/arXiv.2311.11587.
    [19] SELESNICK I W, BARANIUK R G, and KINGSBURY N C. The dual-tree complex wavelet transform[J]. IEEE Signal Processing Magazine, 2005, 22(6): 123–151. doi: 10.1109/MSP.2005.1550194.
    [20] ROGOZHNIKOV A. Einops: Clear and reliable tensor manipulations with Einstein-like notation[C]. The 10th International Conference on Learning Representations, 2022: 1–21.
    [21] MAO Junhua, HUANG J, TOSHEV A, et al. Generation and comprehension of unambiguous object descriptions[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 11–20. doi: 10.1109/cvpr.2016.9.
    [22] FAN Heng, LIN Liting, YANG Fan, et al. LaSOT: A high-quality benchmark for large-scale single object tracking[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 5369–5378. doi: 10.1109/cvpr.2019.00552.
    [23] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2018: 4171–4186. doi: 10.18653/v1/N19-1423.
    [24] LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 9992–10002. doi: 10.1109/iccv48922.2021.00986.
    [25] YANG Li, XU Yan, YUAN Chunfeng, et al. Improving visual grounding with visual-linguistic verification and iterative reasoning[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 9489–9498. doi: 10.1109/cvpr52688.2022.00928.
    [26] YAN Bin, PENG Houwen, FU Jianlong, et al. Learning spatio-temporal transformer for visual tracking[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 10428–10437. doi: 10.1109/iccv48922.2021.01028.
    [27] ZHANG Zhipeng, LIU Yihao, WANG Xiao, et al. Learn to match: Automatic matching network design for visual tracking[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 13319–13328. 10. doi: 1109/iccv48922.2021.01309.
    [28] WANG Ning, ZHOU Wengang, WANG Jie, et al. Transformer meets tracker: Exploiting temporal context for robust visual tracking[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 1571–1580. doi: 10.1109/cvpr46437.2021.00162.
    [29] CHEN Xin, YAN Bin, ZHU Jiawen, et al. Transformer tracking[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 8122–8131. doi: 10.1109/CVPR46437.2021.00803.
    [30] MAYER C, DANELLJAN M, PAUDEL D P, et al. Learning target candidate association to keep track of what not to track[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 13424–13434. doi: 10.1109/iccv48922.2021.01319.
    [31] LIN Liting, FAN Heng, ZHANG Zhipeng, et al. SwinTrack: A simple and strong baseline for transformer tracking[C]. The 36th Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 16743–16754.
    [32] LIU Daqing, ZHANG Hanwang, ZHA Zhengjun, et al. Learning to assemble neural module tree networks for visual grounding[C]. 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 4672–4681. doi: 10.1109/iccv.2019.00477.
    [33] HUANG Binbin, LIAN Dongze, LUO Weixin, et al. Look before you leap: Learning landmark features for one-stage visual grounding[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 16883–16892. doi: 10.1109/cvpr46437.2021.01661.
    [34] YANG Zhengyuan, CHEN Tianlang, WANG Liwei, et al. Improving one-stage visual grounding by recursive sub-query construction[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 387–404. doi: 10.1007/978-3-030-58568-6_23.
    [35] DENG Jiajun, YANG Zhengyuan, CHEN Tianlang, et al. TransVG: End-to-end visual grounding with transformers[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 1749–1759. doi: 10.1109/iccv48922.2021.00179.
  • 期刊类型引用(23)

    1. 胡城,蔡延光,黄嘉铖,曾庆丰. 蚁狮优化算法研究综述. 自动化与信息工程. 2024(03): 1-10+15 . 百度学术
    2. 陈伟,杨盘隆,吴宣够. 改进蚁狮优化算法及其工程应用. 传感技术学报. 2023(04): 565-574 . 百度学术
    3. 郭家虎,时曼玉. 自适应可调节边界的蚁狮优化算法. 安徽理工大学学报(自然科学版). 2023(04): 1-9+18 . 百度学术
    4. 丁敏,夏兴宇,邹永杰,张乐,刘正堂. 基于改进蝴蝶优化算法的无人机3-D航迹规划方法. 南京航空航天大学学报. 2023(05): 851-858 . 百度学术
    5. 王博文,王培,徐鲁豫. 基于紧凑蚁狮算法的三维路径规划研究. 微电子学与计算机. 2023(08): 19-27 . 百度学术
    6. 韩统,汤安迪,周欢,徐登武,谢磊. 基于LASSA算法的多无人机协同航迹规划方法. 系统工程与电子技术. 2022(01): 233-241 . 百度学术
    7. 李彦苍,吴悦. 基于高斯变异的蚁狮算法及其在组合优化中的应用. 中国科技论文. 2022(03): 295-304 . 百度学术
    8. 张浩,覃涛,徐凌桦,王霄,杨靖. 改进多目标蚁狮算法的WSNs节点部署策略. 西安电子科技大学学报. 2022(05): 47-59 . 百度学术
    9. 来磊,吴德伟,邹鲲,韩昆,李海林. 基于多准则交互膜进化算法的UAV三维航迹规划. 系统工程与电子技术. 2021(01): 138-146 . 百度学术
    10. 马凯强,任利锋,曾喆,万剑华. 顾及恶劣海况的船舶救援路径规划. 海洋科学. 2021(05): 39-46 . 百度学术
    11. 来磊,邹鲲,吴德伟,李保中. 交互策略改进MOFA进化的多UAV协同航迹规划. 系统工程与电子技术. 2021(08): 2282-2289 . 百度学术
    12. 王沙沙,张则强,刘俊琦,陈凤. 多路径交互环形过道布置问题建模及改进蚁狮算法优化. 计算机集成制造系统. 2021(08): 2237-2247 . 百度学术
    13. 靳江锋,刘旭. 联合多核FCM和改进GOA的多无人机协同侦查航迹规划. 兵器装备工程学报. 2021(11): 181-188 . 百度学术
    14. 刘景森,霍宇,李煜. 优选策略的自适应蚁狮优化算法. 模式识别与人工智能. 2020(02): 121-132 . 百度学术
    15. 李庆华,尤越,沐雅琪,张钊,冯超. 一种针对大型凹型障碍物的组合导航算法. 电子与信息学报. 2020(04): 917-923 . 本站查看
    16. 李靖,杨帆. 基于改进灰狼优化算法的区域监测机器人路径规划. 科学技术与工程. 2020(15): 6122-6129 . 百度学术
    17. 孟德智,葛斌. 基于蜂窝分区的蚁狮优化自适应路由算法. 金陵科技学院学报. 2020(02): 17-23 . 百度学术
    18. 王茜,何庆,林杰,杨荣莹. 精英反向学习带扰动因子的混沌蚁狮算法. 智能计算机与应用. 2020(08): 51-57 . 百度学术
    19. 杨帆,方玺,高飞,谢良. 面向三维陷阱空间的改进RRT算法航迹规划. 武汉理工大学学报. 2020(10): 98-106 . 百度学术
    20. 黄长强. 未来空战过程智能化关键技术研究. 航空兵器. 2019(01): 11-19 . 百度学术
    21. 徐钦帅,何庆,魏康园. 改进蚁狮算法的无线传感器网络覆盖优化. 传感技术学报. 2019(02): 266-275 . 百度学术
    22. ZHANG Zhenxing,YANG Rennong,LI Huanyu,FANG Yuhuan,HUANG Zhenyu,ZHANG Ying. Antlion optimizer algorithm based on chaos search and its application. Journal of Systems Engineering and Electronics. 2019(02): 352-365 . 必应学术
    23. 边莉,何辉,杨彦方,刘文静. 基于改进型蚁狮算法的PID控制器参数优化. 吉林大学学报(信息科学版). 2019(03): 292-298 . 百度学术

    其他类型引用(23)

  • 加载中
图(7) / 表(6)
计量
  • 文章访问数:  239
  • HTML全文浏览量:  106
  • PDF下载量:  39
  • 被引次数: 46
出版历程
  • 收稿日期:  2024-04-09
  • 修回日期:  2024-07-19
  • 网络出版日期:  2024-08-03
  • 刊出日期:  2024-11-10

目录

    /

    返回文章
    返回