| Citation: | LIU Zhongmin, LI Zhenhua, HU Wenjin. Vision-Language Tracking Method Combining Bi-level Routing Perception and Scattered Vision Transformation[J]. Journal of Electronics & Information Technology, 2024, 46(11): 4236-4246. doi: 10.11999/JEIT240257 | 
 
	                | [1] | GUO Mingzhe, ZHANG Zhipeng, JING Liping,    et al. Divert more attention to vision-language object tracking[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence. doi:  10.1109/TPAMI.2024.3409078. | 
| [2] | 许廷发, 王颖, 史国凯, 等. 深度学习单目标跟踪方法的基础架构研究进展[J]. 光学学报, 2023, 43(15): 1510003. doi:  10.3788/AOS230746. XU Tingfa, WANG Ying, SHI Guokai, et al. Research progress in fundamental architecture of deep learning-based single object tracking method[J]. Acta Optica Sinica, 2023, 43(15): 1510003. doi:  10.3788/AOS230746. | 
| [3] | WANG Xiao, SHU Xiujun, ZHANG Zhipeng,    et al. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 13758–13768. doi:  10.1109/cvpr46437.2021.01355. | 
| [4] | LI Yihao, YU Jun, CAI Zhongpeng,    et al. Cross-modal target retrieval for tracking by natural language[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, New Orleans, USA, 2022: 4927–4936. doi:  10.1109/cvprw56347.2022.00540. | 
| [5] | FENG Qi, ABLAVSKY V, BAI Qinxun,    et al. Siamese natural language tracker: Tracking by natural language descriptions with Siamese trackers[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 5847–5856. doi:  10.1109/cvpr46437.2021.00579. | 
| [6] | ZHENG Yaozong, ZHONG Bineng, LIANG Qihua, et al. Toward unified token learning for vision-language tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(4): 2125–2135. doi:  10.1109/TCSVT.2023.3301933. | 
| [7] | ZHAO Haojie, WANG Xiao, WANG Dong, et al. Transformer vision-language tracking via proxy token guided cross-modal fusion[J]. Pattern Recognition Letters, 2023, 168: 10–16. doi:  10.1016/j.patrec.2023.02.023. | 
| [8] | ZHOU Li, ZHOU Zikun, MAO Kaige,    et al. Joint visual grounding and tracking with natural language specification[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 23151–23160. doi:  10.1109/cvpr52729.2023.02217. | 
| [9] | VASWANI A, SHAZEER N, PARMAR N,    et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010. | 
| [10] | SONG Zikai, LUO Run, YU Junqing,    et al. Compact transformer tracker with correlative masked modeling[C]. The 37th AAAI Conference on Artificial Intelligence, Washington, USA, 2023: 2321–2329. doi:  10.1609/aaai.v37i2.25327. | 
| [11] | WANG Yuanyun, ZHANG Wenshuang, LAI Changwang, et al. Adaptive temporal feature modeling for visual tracking via cross-channel learning[J]. Knowledge-Based Systems, 2023, 265: 110380. doi:  10.1016/j.knosys.2023.110380. | 
| [12] | ZHAO Moju, OKADA K, and INABA M. TrTr: Visual tracking with transformer[J]. arXiv: 2105.03817, 2021. doi:  10.48550/arXiv.2105.03817. | 
| [13] | TANG Chuanming, WANG Xiao, BAI Yuanchao, et al. Learning spatial-frequency transformer for visual object tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9): 5102–5116. doi:  10.1109/tcsvt.2023.3249468. | 
| [14] | YE Botao, CHANG Hong, MA Bingpeng,    et al. Joint feature learning and relation modeling for tracking: A one-stream framework[C]. The 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 341–357. doi:  10.1007/978-3-031-20047-2_20. | 
| [15] | LI Zhenyang, TAO Ran, GAVVES E,    et al. Tracking by natural language specification[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 7350–7358. doi:  10.1109/cvpr.2017.777. | 
| [16] | YANG Zhengyuan, KUMAR T, CHEN Tianlang, et al. Grounding-tracking-integration[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(9): 3433–3443. doi:  10.1109/tcsvt.2020.3038720. | 
| [17] | FENG Qi, ABLAVSKY V, BAI Qinxun,    et al. Real-time visual object tracking with natural language description[C]. 2020 IEEE Winter Conference on Applications of Computer Vision, Snowmass, USA, 2020: 689–698. doi:  10.1109/wacv45572.2020.9093425. | 
| [18] | ZHANG Xin, SONG Yingze, SONG Tingting,    et al. AKConv: Convolutional kernel with arbitrary sampled shapes and arbitrary number of parameters[J]. arXiv: 2311.11587, 2023. doi:  10.48550/arXiv.2311.11587. | 
| [19] | SELESNICK I W, BARANIUK R G, and KINGSBURY N C. The dual-tree complex wavelet transform[J]. IEEE Signal Processing Magazine, 2005, 22(6): 123–151. doi:  10.1109/MSP.2005.1550194. | 
| [20] | ROGOZHNIKOV A. Einops: Clear and reliable tensor manipulations with Einstein-like notation[C]. The 10th International Conference on Learning Representations, 2022: 1–21. | 
| [21] | MAO Junhua, HUANG J, TOSHEV A,    et al. Generation and comprehension of unambiguous object descriptions[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 11–20. doi:  10.1109/cvpr.2016.9. | 
| [22] | FAN Heng, LIN Liting, YANG Fan,    et al. LaSOT: A high-quality benchmark for large-scale single object tracking[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 5369–5378. doi:  10.1109/cvpr.2019.00552. | 
| [23] | DEVLIN J, CHANG Mingwei, LEE K,    et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2018: 4171–4186. doi:  10.18653/v1/N19-1423. | 
| [24] | LIU Ze, LIN Yutong, CAO Yue,    et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 9992–10002. doi:  10.1109/iccv48922.2021.00986. | 
| [25] | YANG Li, XU Yan, YUAN Chunfeng,    et al. Improving visual grounding with visual-linguistic verification and iterative reasoning[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 9489–9498. doi:  10.1109/cvpr52688.2022.00928. | 
| [26] | YAN Bin, PENG Houwen, FU Jianlong,    et al. Learning spatio-temporal transformer for visual tracking[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 10428–10437. doi:  10.1109/iccv48922.2021.01028. | 
| [27] | ZHANG Zhipeng, LIU Yihao, WANG Xiao,    et al. Learn to match: Automatic matching network design for visual tracking[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 13319–13328. 10. doi:  1109/iccv48922.2021.01309. | 
| [28] | WANG Ning, ZHOU Wengang, WANG Jie,    et al. Transformer meets tracker: Exploiting temporal context for robust visual tracking[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 1571–1580. doi:  10.1109/cvpr46437.2021.00162. | 
| [29] | CHEN Xin, YAN Bin, ZHU Jiawen,    et al. Transformer tracking[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 8122–8131. doi:  10.1109/CVPR46437.2021.00803. | 
| [30] | MAYER C, DANELLJAN M, PAUDEL D P,    et al. Learning target candidate association to keep track of what not to track[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 13424–13434. doi:  10.1109/iccv48922.2021.01319. | 
| [31] | LIN Liting, FAN Heng, ZHANG Zhipeng,    et al. SwinTrack: A simple and strong baseline for transformer tracking[C]. The 36th Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 16743–16754. | 
| [32] | LIU Daqing, ZHANG Hanwang, ZHA Zhengjun,    et al. Learning to assemble neural module tree networks for visual grounding[C]. 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 4672–4681. doi:  10.1109/iccv.2019.00477. | 
| [33] | HUANG Binbin, LIAN Dongze, LUO Weixin,    et al. Look before you leap: Learning landmark features for one-stage visual grounding[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 16883–16892. doi:  10.1109/cvpr46437.2021.01661. | 
| [34] | YANG Zhengyuan, CHEN Tianlang, WANG Liwei,    et al. Improving one-stage visual grounding by recursive sub-query construction[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 387–404. doi:  10.1007/978-3-030-58568-6_23. | 
| [35] | DENG Jiajun, YANG Zhengyuan, CHEN Tianlang,    et al. TransVG: End-to-end visual grounding with transformers[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 1749–1759. doi:  10.1109/iccv48922.2021.00179. | 
