Multi-granularity Text Perception and Hierarchical Feature Interaction Method for Visual Grounding

CAI Hua; RAN Yue; FU Qiang; LI Junyan; ZHANG Chenjie; SUN Junxi

doi:10.11999/JEIT250387

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 >

CAI Hua, RAN Yue, FU Qiang, LI Junyan, ZHANG Chenjie, SUN Junxi. Multi-granularity Text Perception and Hierarchical Feature Interaction Method for Visual Grounding[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250387

Citation:

CAI Hua, RAN Yue, FU Qiang, LI Junyan, ZHANG Chenjie, SUN Junxi. Multi-granularity Text Perception and Hierarchical Feature Interaction Method for Visual Grounding[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250387

Citation:

PDF( 8259 KB)

Multi-granularity Text Perception and Hierarchical Feature Interaction Method for Visual Grounding

doi: 10.11999/JEIT250387 cstr: 32379.14.JEIT250387

1.
School of Electronical and Information Engineering, Changchun University of Science and Technology, Changchun 130022, China
2.
Institute of Space Opto-Electronic Technology, Changchun University of Science and Technology, Changchun 130022, China
3.
School of Information Science and Technology, Northeast Normal University, Changchun 130117, China

Funds: The Major Program of the National Natural Science Foundation of China (61890963), The Joint Funds of the National Natural Foundation of China (U2341226), Jilin Provincial Department of Science and Technology (20240302089GX)

Received Date: 2025-05-08
Rev Recd Date: 2025-09-12

Available Online: 2025-09-18

Abstract

Abstract

Objective Visual grounding requires effective use of textual information for accurate target localization. Traditional methods primarily emphasize feature fusion but often neglect the guiding role of text, which limits localization accuracy. To address this limitation, a Multi-granularity Text Perception and Hierarchical Feature Interaction method for Visual Grounding (ThiVG) is proposed. In this method, the hierarchical feature interaction module is progressively incorporated into the image encoder to enhance the semantic representation of image features. The multi-granularity text-aware module is designed to generate weighted text with spatial and semantic enhancement, and a preliminary Hadamard product-based fusion strategy is applied to refine image features for cross-modal fusion. Experimental results show that the proposed method substantially improves localization accuracy and effectively alleviates the performance bottleneck arising from over-reliance on feature fusion modules in conventional approaches. Methods The proposed method comprises an image-text feature extraction network, a hierarchical feature interaction module, a multi-granularity text perception module, and a graphic-text cross-modal fusion and target localization network (Fig. 1). The image-text feature extraction network includes image and text branches for extracting their respective features (Fig. 2). In the image branch, text features are incorporated into the image encoder through the hierarchical feature interaction module (Fig. 3). This enables text information to filter and update image features, thereby strengthening their semantic expressiveness. The multi-granularity text perception module employs three perception mechanisms to fully extract spatial and semantic information from the text (Fig. 4). It generates weighted text, which is preliminarily fused with image features through a Hadamard product-based strategy, providing fine-grained image features for subsequent cross-modal fusion. The graphic-text cross-modal fusion module then deeply integrates image and text features using a Transformer encoder (Fig. 5), capturing their complex relationships. Finally, a Multilayer Perceptron (MLP) performs regression to predict the bounding box coordinates of the target location. This method not only achieves effective integration of image and text information but also improves accuracy and robustness in visual grounding tasks through hierarchical feature interaction and deep cross-modal fusion, offering a novel approach to complex localization challenges. Results and Discussions Comparison experiments demonstrate that the proposed method achieves substantial accuracy gains across five benchmark visual localization datasets (Tables 1 and 2), with particularly strong performance on the long-text RefCOCOg dataset. Although the model has a larger parameter size, comparisons of parameter counts and training-inference times indicate that its overall performance still exceeds that of traditional methods (Table 3). Ablation studies further verify the contribution of each key module (Table 4). The hierarchical feature interaction module improves the semantic representation of image features by incorporating textual information into the image encoder (Table 5). The multi-granularity text perception module enhances attention to key textual components through perception mechanisms and adaptive weighting (Table 6). By avoiding excessive modification of the text structure, it markedly strengthens the model’s capacity to process long text and complex sentences. Experiments on the number of encoder layers in the cross-modal fusion module show that a 6-layer deep fusion encoder effectively filters irrelevant background information (Table 7), yielding a more precise feature representation for the localization regression MLP. Generalization tests and visualization analyses further demonstrate that the proposed method maintains high adaptability and accuracy across diverse and challenging localization scenarios (Figs. 6, and 7). Conclusions This study proposes a visual grounding algorithm that integrates multi-granularity text perception with hierarchical feature interaction, effectively addressing the under-utilization of textual information and the reliance on single-feature fusion in existing approaches. Key innovations include the hierarchical feature interaction module in the image branch, which markedly enhances the semantic representation of image features; the multi-granularity text perception module, which fully exploits textual information to generate weighted text with spatial and semantic enhancement; and a preliminary Hadamard product-based fusion strategy, which provides fine-grained image representations for cross-modal fusion. Experimental results show that the proposed method achieves substantial accuracy improvements on classical vision datasets and demonstrates strong adaptability and robustness across diverse and complex localization scenarios. Future work will focus on extending this method to accommodate more diverse text inputs and further improving localization performance in challenging visual environments.
- Visual grounding,
- Multi-granularity,
- Text perception,
- Hierarchical feature interaction,
- Adaptive text weighting,
- Hadamard product

FullText(HTML)

References(32)

References

[1]	XIAO Linhui, YANG Xiaoshan, LAN Xiangyuan, et al. Towards visual grounding: A survey[EB/OL]. https://arxiv.org/abs/2412.20206, 2024.
[2]	LI Yong, WANG Yuanzhi, and CUI Zhen. Decoupled multimodal distilling for emotion recognition[C]. The 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 6631–6640. doi: 10.1109/CVPR52729.2023.00641.
[3]	HU Ronghang, ROHRBACH M, Andreas J, et al. Modeling relationships in referential expressions with compositional modular networks[C]. The 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4418–4427. doi: 10.1109/CVPR.2017.470.
[4]	YU Licheng, LIN Zhe, SHEN Xiaohui, et al. MAttNet: Modular attention network for referring expression comprehension[C]. The 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 1307–1315. doi: 10.1109/CVPR.2018.00142.
[5]	YANG Sibei, LI Guanbin, and YU Yizhou. Dynamic graph attention for referring expression comprehension[C]. The 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 4643–4652. doi: 10.1109/ICCV.2019.00474.
[6]	CHEN Long, MA Wenbo, XIAO Jun, et al. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding[C]. The 35th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2021: 1036–1044. doi: 10.1609/aaai.v35i2.16188.
[7]	PLUMMER B A, KORDAS P, KIAPOUR M H, et al. Conditional image-text embedding networks[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 258–274. doi: 10.1007/978-3-030-01258-8_16.
[8]	YU Zhou, YU Jun, XIANG Chenchao, et al. Rethinking diversified and discriminative proposal generation for visual grounding[C]. The Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 2018: 1114–1120. doi: 10.24963/ijcai.2018/155.
[9]	YANG Zhengyuan, GONG Boqing, WANG Liwei, et al. A fast and accurate one-stage approach to visual grounding[C]. The 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 4682–4692. doi: 10.1109/ICCV.2019.00478.
[10]	LIAO Yue, LIU Si, LI Guanbin, et al. A real-time cross-modality correlation filtering method for referring expression comprehension[C]. The 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 10877–10886. doi: 10.1109/CVPR42600.2020.01089.
[11]	YANG Zhengyuan, CHEN Tianlang, WANG Liwei, et al. Improving one-stage visual grounding by recursive sub-query construction[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 387–404. doi: 10.1007/978-3-030-58568-6_23.
[12]	HUANG Binbin, LIAN Dongze, LUO Weixin, et al. Look before you leap: Learning landmark features for one-stage visual grounding[C]. The 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 16883–16892. doi: 10.1109/CVPR46437.2021.01661.
[13]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
[14]	DENG Jiajun, YANG Zhengyuan, CHEN Tianlang, et al. TransVG: End-to-end visual grounding with transformers[C]. The 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 1749–1759. doi: 10.1109/ICCV48922.2021.00179.
[15]	YANG Li, XU Yan, YUAN Chunfeng, et al. Improving visual grounding with visual-linguistic verification and iterative reasoning[C]. The 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 9489–9498. doi: 10.1109/CVPR52688.2022.00928.
[16]	DU Ye, FU Zehua, LIU Qingjie, et al. Visual grounding with transformers[C]. The 2022 IEEE International Conference on Multimedia and Expo, Taipei, China, 2022: 1–6. doi: 10.1109/ICME52920.2022.9859880.
[17]	LI Kun, LI Jiaxiu, GUO Dan, et al. Transformer-based visual grounding with cross-modality interaction[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023, 19(6): 183. doi: 10.1145/3587251.
[18]	TANG Wei, LI Liang, LIU Xuejing, et al. Context disentangling and prototype inheriting for robust visual grounding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(5): 3213–3229. doi: 10.1109/TPAMI.2023.3339628.
[19]	SU Wei, MIAO Peihan, DOU Huanzhang, et al. ScanFormer: Referring expression comprehension by iteratively scanning[C]. The 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 13449–13458. doi: 10.1109/cvpr52733.2024.01277.
[20]	CHEN Wei, CHEN Long, and WU Yu. An efficient and effective transformer decoder-based framework for multi-task visual grounding[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 125–141. doi: 10.1007/978-3-031-72995-9_8.
[21]	YAO Haibo, WANG Lipeng, CAI Chengtao, et al. Language conditioned multi-scale visual attention networks for visual grounding[J]. Image and Vision Computing, 2024, 150: 105242. doi: 10.1016/j.imavis.2024.105242.
[22]	DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. The 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, Minneapolis, USA, 2019: 4171–4186. doi: 10.18653/v1/N19-1423.
[23]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. The 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.
[24]	丁博, 张立宝, 秦健, 等. 语义增强图像-文本预训练模型的零样本三维模型分类[J]. 电子与信息学报, 2024, 46(8): 3314–3323. doi: 10.11999/JEIT231161. DING Bo, ZHANG Libao, QIN Jian, et al. Zero-shot 3D shape classification based on semantic-enhanced language-image pre-training model[J]. Journal of Electronics & Information Technology, 2024, 46(8): 3314–3323. doi: 10.11999/JEIT231161.
[25]	JAIN R, RAI R S, JAIN S, et al. Real time sentiment analysis of natural language using multimedia input[J]. Multimedia Tools and Applications, 2023, 82(26): 41021–41036. doi: 10.1007/s11042-023-15213-3.
[26]	HAN Tian, ZHANG Zhu, REN Mingyuan, et al. Text emotion recognition based on XLNet-BiGRU-Att[J]. Electronics, 2023, 12(12): 2704. doi: 10.3390/electronics12122704.
[27]	ZHANG Xiangsen, WU Zhongqiang, LIU Ke, et al. Text sentiment classification based on BERT embedding and sliced multi-head self-attention Bi-GRU[J]. Sensors, 2023, 23(3): 1481. doi: 10.3390/s23031481.
[28]	YU Jun, ZHANG Donglin, SHU Zhenqiu, et al. Adaptive multi-modal fusion hashing via Hadamard matrix[J]. Applied Intelligence, 2022, 52(15): 17170–17184. doi: 10.1007/s10489-022-03367-w.
[29]	YU Licheng, POIRSON P, YANG Shan, et al. Modeling context in referring expressions[C]. The 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 69–85. doi: 10.1007/978-3-319-46475-6_5.
[30]	MAO Junhua, HUANG J, TOSHEV A, et al. Generation and comprehension of unambiguous object descriptions[C]. The 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 11–20. doi: 10.1109/CVPR.2016.9.
[31]	KAZEMZADEH S, ORDONEZ V, MATTEN M, et al. ReferItGame: Referring to objects in photographs of natural scenes[C]. The 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 787–798, Doha, Qatar. doi: 10.3115/v1/D14-1086.
[32]	PLUMMER B A, WANG Liwei, CERVANTES C M, et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models[C]. The 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 2641–2649. doi: 10.1109/ICCV.2015.303.