Citation: | CAI Hua, RAN Yue, FU Qiang, LI Junyan, ZHANG Chenjie, SUN Junxi. Multi-granularity Text Perception and Hierarchical Feature Interaction Method for Visual Grounding[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250387 |
[1] |
XIAO Linhui, YANG Xiaoshan, LAN Xiangyuan, et al. Towards visual grounding: A survey[EB/OL]. https://arxiv.org/abs/2412.20206, 2024.
|
[2] |
LI Yong, WANG Yuanzhi, and CUI Zhen. Decoupled multimodal distilling for emotion recognition[C]. The 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 6631–6640. doi: 10.1109/CVPR52729.2023.00641.
|
[3] |
HU Ronghang, ROHRBACH M, Andreas J, et al. Modeling relationships in referential expressions with compositional modular networks[C]. The 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4418–4427. doi: 10.1109/CVPR.2017.470.
|
[4] |
YU Licheng, LIN Zhe, SHEN Xiaohui, et al. MAttNet: Modular attention network for referring expression comprehension[C]. The 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 1307–1315. doi: 10.1109/CVPR.2018.00142.
|
[5] |
YANG Sibei, LI Guanbin, and YU Yizhou. Dynamic graph attention for referring expression comprehension[C]. The 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 4643–4652. doi: 10.1109/ICCV.2019.00474.
|
[6] |
CHEN Long, MA Wenbo, XIAO Jun, et al. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding[C]. The 35th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2021: 1036–1044. doi: 10.1609/aaai.v35i2.16188. (查阅网上资料,未找到本条文献出版地信息,请确认).
|
[7] |
PLUMMER B A, KORDAS P, KIAPOUR M H, et al. Conditional image-text embedding networks[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 258–274. doi: 10.1007/978-3-030-01258-8_16.
|
[8] |
YU Zhou, YU Jun, XIANG Chenchao, et al. Rethinking diversified and discriminative proposal generation for visual grounding[C]. The Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 2018: 1114–1120. doi: 10.24963/ijcai.2018/155.
|
[9] |
YANG Zhengyuan, GONG Boqing, WANG Liwei, et al. A fast and accurate one-stage approach to visual grounding[C]. The 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 4682–4692. doi: 10.1109/ICCV.2019.00478.
|
[10] |
LIAO Yue, LIU Si, LI Guanbin, et al. A real-time cross-modality correlation filtering method for referring expression comprehension[C]. The 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 10877–10886. doi: 10.1109/CVPR42600.2020.01089.
|
[11] |
YANG Zhengyuan, CHEN Tianlang, WANG Liwei, et al. Improving one-stage visual grounding by recursive sub-query construction[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 387–404. doi: 10.1007/978-3-030-58568-6_23.
|
[12] |
HUANG Binbin, LIAN Dongze, LUO Weixin, et al. Look before you leap: Learning landmark features for one-stage visual grounding[C]. The 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 16883–16892. doi: 10.1109/CVPR46437.2021.01661.
|
[13] |
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
|
[14] |
DENG Jiajun, YANG Zhengyuan, CHEN Tianlang, et al. TransVG: End-to-end visual grounding with transformers[C]. The 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 1749–1759. doi: 10.1109/ICCV48922.2021.00179.
|
[15] |
YANG Li, XU Yan, YUAN Chunfeng, et al. Improving visual grounding with visual-linguistic verification and iterative reasoning[C]. The 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 9489–9498. doi: 10.1109/CVPR52688.2022.00928.
|
[16] |
DU Ye, FU Zehua, LIU Qingjie, et al. Visual grounding with transformers[C]. The 2022 IEEE International Conference on Multimedia and Expo, Taipei, China, 2022: 1–6. doi: 10.1109/ICME52920.2022.9859880.
|
[17] |
LI Kun, LI Jiaxiu, GUO Dan, et al. Transformer-based visual grounding with cross-modality interaction[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023, 19(6): 183. doi: 10.1145/3587251.
|
[18] |
TANG Wei, LI Liang, LIU Xuejing, et al. Context disentangling and prototype inheriting for robust visual grounding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(5): 3213–3229. doi: 10.1109/TPAMI.2023.3339628.
|
[19] |
SU Wei, MIAO Peihan, DOU Huanzhang, et al. ScanFormer: Referring expression comprehension by iteratively scanning[C]. The 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 13449–13458. doi: 10.1109/cvpr52733.2024.01277.
|
[20] |
CHEN Wei, CHEN Long, and WU Yu. An efficient and effective transformer decoder-based framework for multi-task visual grounding[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 125–141. doi: 10.1007/978-3-031-72995-9_8.
|
[21] |
YAO Haibo, WANG Lipeng, CAI Chengtao, et al. Language conditioned multi-scale visual attention networks for visual grounding[J]. Image and Vision Computing, 2024, 150: 105242. doi: 10.1016/j.imavis.2024.105242.
|
[22] |
CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 213–229. doi: 10.1007/978-3-030-58452-8_13.
|
[23] |
DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. The 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, Minneapolis, USA, 2019: 4171–4186. doi: 10.18653/v1/N19-1423.
|
[24] |
HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. The 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.
|
[25] |
丁博, 张立宝, 秦健, 等. 语义增强图像-文本预训练模型的零样本三维模型分类[J]. 电子与信息学报, 2024, 46(8): 3314–3323. doi: 10.11999/JEIT231161.
DING Bo, ZHANG Libao, QIN Jian, et al. Zero-shot 3D shape classification based on semantic-enhanced language-image pre-training model[J]. Journal of Electronics & Information Technology, 2024, 46(8): 3314–3323. doi: 10.11999/JEIT231161.
|
[26] |
JAIN R, RAI R S, JAIN S, et al. Real time sentiment analysis of natural language using multimedia input[J]. Multimedia Tools and Applications, 2023, 82(26): 41021–41036. doi: 10.1007/s11042-023-15213-3.
|
[27] |
HAN Tian, ZHANG Zhu, REN Mingyuan, et al. Text emotion recognition based on XLNet-BiGRU-Att[J]. Electronics, 2023, 12(12): 2704. doi: 10.3390/electronics12122704.
|
[28] |
ZHANG Xiangsen, WU Zhongqiang, LIU Ke, et al. Text sentiment classification based on BERT embedding and sliced multi-head self-attention Bi-GRU[J]. Sensors, 2023, 23(3): 1481. doi: 10.3390/s23031481.
|
[29] |
YU Jun, ZHANG Donglin, SHU Zhenqiu, et al. Adaptive multi-modal fusion hashing via Hadamard matrix[J]. Applied Intelligence, 2022, 52(15): 17170–17184. doi: 10.1007/s10489-022-03367-w.
|
[30] |
YU Licheng, POIRSON P, YANG Shan, et al. Modeling context in referring expressions[C]. The 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 69–85. doi: 10.1007/978-3-319-46475-6_5.
|
[31] |
MAO Junhua, HUANG J, TOSHEV A, et al. Generation and comprehension of unambiguous object descriptions[C]. The 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 11–20. doi: 10.1109/CVPR.2016.9.
|
[32] |
KAZEMZADEH S, ORDONEZ V, MATTEN M, et al. ReferItGame: Referring to objects in photographs of natural scenes[C]. The 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 787–798, Doha, Qatar. doi: 10.3115/v1/D14-1086.
|
[33] |
PLUMMER B A, WANG Liwei, CERVANTES C M, et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models[C]. The 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 2641–2649. doi: 10.1109/ICCV.2015.303.
|