高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

多粒度文本感知分层特征交互的视觉定位方法

才华 冉越 付强 李军龑 张晨洁 孙俊喜

才华, 冉越, 付强, 李军龑, 张晨洁, 孙俊喜. 多粒度文本感知分层特征交互的视觉定位方法[J]. 电子与信息学报. doi: 10.11999/JEIT250387
引用本文: 才华, 冉越, 付强, 李军龑, 张晨洁, 孙俊喜. 多粒度文本感知分层特征交互的视觉定位方法[J]. 电子与信息学报. doi: 10.11999/JEIT250387
CAI Hua, RAN Yue, FU Qiang, LI Junyan, ZHANG Chenjie, SUN Junxi. Multi-granularity Text Perception and Hierarchical Feature Interaction Method for Visual Grounding[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250387
Citation: CAI Hua, RAN Yue, FU Qiang, LI Junyan, ZHANG Chenjie, SUN Junxi. Multi-granularity Text Perception and Hierarchical Feature Interaction Method for Visual Grounding[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250387

多粒度文本感知分层特征交互的视觉定位方法

doi: 10.11999/JEIT250387 cstr: 32379.14.JEIT250387
基金项目: 国家自然科学基金重大项目(61890963),国家自然基金联合基金(U2341226),吉林省科技厅(20240302089GX)
详细信息
    作者简介:

    才华:男,教授,研究方向为计算机视觉与自然语言处理

    冉越:男,硕士生,研究方向为计算机视觉与视觉语言多模态

    付强:男,教授,研究方向为光学传输特性测试与多维度成像探测

    李军龑:男,硕士生,研究方向为计算机视觉与视觉语言多模态

    张晨洁:女,副教授,研究方向为机器学习与图像处理

    孙俊喜:男,教授,研究方向为AI视觉与智能感知技术

    通讯作者:

    才华 caihua@cust.edu.cn

  • 中图分类号: TN911.73; TP391.41

Multi-granularity Text Perception and Hierarchical Feature Interaction Method for Visual Grounding

Funds: The Major Program of the National Natural Science Foundation of China (61890963), The Joint Funds of the National Natural Foundation of China (U2341226), Jilin Provincial Department of Science and Technology (20240302089GX)
  • 摘要: 现有视觉定位方法在文本引导目标定位和特征融合方面存在显著不足,主要表现为未能充分利用文本信息,并且整体性能过于依赖特征提取后的融合过程。针对这一问题,该文提出一种多粒度文本感知分层特征交互的视觉定位方法。该方法在图像分支中引入分层特征交互模块,利用文本信息增强与文本相关的图像特征;多粒度文本感知模块深入挖掘文本语义内容,生成具有空间和语义增强的加权文本。在此基础上,采用基于哈达玛积的初步融合策略融合加权文本和图像,为跨模态特征融合提供更为精细的图像表示。利用Transformer编码器进行跨模态特征融合,通过多层感知机回归定位坐标。实验结果表明,该文方法在5个经典视觉定位数据集上均取得了显著的精度提升,成功解决了传统方法过度依赖特征融合模块而导致的性能瓶颈问题。
  • 图  1  ThiVG网络架构图

    图  2  图像文本特征提取网络

    图  3  分层特征交互模块

    图  4  多粒度文本感知模块

    图  5  图文跨模态融合与目标定位网络

    图  6  真实场景泛化测试结果

    图  7  本文方法可视化结果

    1  ThiVG

     输入:原始图像${\boldsymbol{I}}$,原始文本${\boldsymbol{T}}$
     输出:视觉定位位置坐标$(x,y,h,w)$
     (1) ${\mathrm{Function}}\; {\mathrm{ThiVG}}({\boldsymbol{I}}{,}{\boldsymbol{T}})$
     (2)  //图像文本特征提取
     (3)  $ {{\boldsymbol{I}}_{\rm m}} \leftarrow {\boldsymbol{I}},{{\boldsymbol{T}}_{\rm m}} \leftarrow {\boldsymbol{T}} $ //原始图像文本生成掩码图像文本
     (4)  $ {{\boldsymbol{F}}_{\rm T}} \leftarrow {\mathrm{BERT}}({\boldsymbol{T}},{{\boldsymbol{T}}_{\rm m}}); $
     (5)  $ {\boldsymbol{Z}},{{\boldsymbol{Z}}_{\rm m}} \leftarrow {\mathrm{ResNet}}({\boldsymbol{I}},{{\boldsymbol{I}}_{\rm m}}); $
     (6)  ${\mathrm{for}}\; n = 1 \;{\mathrm{to}} 6\;{\mathrm{ do}}$
     (7)   ${\boldsymbol{F}}_{\mathrm{I}}^n \leftarrow {\mathrm{Transformer}}({\boldsymbol{Z}},{{\boldsymbol{Z}}_{\rm m}},{{\boldsymbol{F}}_{\rm T}});$
     (8)   ${\boldsymbol{F}}_{{\mathrm{IT}}}^n \leftarrow {\mathrm{HFI}}({\boldsymbol{F}}_{\mathrm{I}}^n,{{\boldsymbol{F}}_{\rm T}});$ //分层特征交互
     (9)   $ {\boldsymbol{Z}} \leftarrow {\boldsymbol{F}}_{{\mathrm{IT}}}^n; $ //特征更新
     (10) ${{\boldsymbol{F}}'_{{\mathrm{IT}}}} \leftarrow {\boldsymbol{Z}};$ //文本筛选的图像特征
     (11) //多粒度文本感知
     (12) ${\boldsymbol{F}}_{\rm T}^{{\text{bi-gru}}} \leftarrow {\text{Bi-GRU}}({{\boldsymbol{F}}_{\rm T}});$ // 文本隐藏状态特征
     (13) ${{\boldsymbol{W}}_{{\mathrm{CDS}}}} \leftarrow {\mathrm{MTP}}({\mathrm{Linear}}({\boldsymbol{F}}_{\rm T}^{{\text{bi-gru}}}));$ //感知权重生成
     (14) ${{\boldsymbol{F}}'_{\rm T}} \leftarrow {\mathrm{ATW}}({{\boldsymbol{W}}_{{\mathrm{CDS}}}},{{\boldsymbol{F}}_{\rm T}});$ //自适应文本加权
     (15) ${{\boldsymbol{F}}_u} \leftarrow {\mathrm{Hadamard}}({{\boldsymbol{F}}'_{{\mathrm{IT}}}},{{\boldsymbol{F}}'_{\rm T}});$ //哈达玛积(初步融合)
     (16) //图文跨模态融合(深度融合)
     (17) ${\mathrm{for}}\; n = 1\; {\mathrm{to}} \;6\; {\mathrm{do}}$
     (18)  ${\boldsymbol{F}}_{\mathrm{u}}^n \leftarrow {\mathrm{Transformer}}({{\boldsymbol{F}}_u},{{\boldsymbol{Z}}_{\rm m}},{{\boldsymbol{F}}_{\rm T}});$
     (19)  ${\mathrm{REG}} \leftarrow {\boldsymbol{F}}_{\mathrm{u}}^n;$ //更新REG标记
     (20) 回归定位框$(x,y,h,w) \leftarrow {\mathrm{MLP}}({\mathrm{REG}});$
     (21) 计算总损失${L_{{\mathrm{total}}}} = {L_{{\mathrm{L1}}}} + {L_{{\mathrm{GIOU}}}};$
     (22) return
    下载: 导出CSV

    表  1  RefCOCO, RefCOCO+和RefCOCOg数据集对比实验(%)

    方法 模型 骨干
    网络
    RefCOCO RefCOCO+ RefCOCOg
    val testA testB val testA testB val-g val-u test-u
    2阶段
    方法
    CMN[3] VGG16 - 71.03 65.77 - 54.32 47.76 57.47 - -
    MAttNet[4] ResNet101 76.65 71.14 69.99 65.33 71.62 56.02 - 66.58 67.27
    DGA[5] VGG16 - 78.42 65.53 - 69.07 51.99 - - 63.28
    Ref-NMS[6] ResNet101 80.70 84.00 76.04 68.25 73.68 59.42 - 70.55 70.62
    1阶段
    方法
    FAOA[9] DarkNet53 72.54 74.35 68.50 56.81 60.23 49.60 56.12 61.33 60.36
    RCCF[10] DLA34 - 81.06 71.85 - 70.35 56.32 - - 65.73
    ReSC-Large[11] DarkNet53 77.63 80.45 72.30 63.59 68.36 56.81 63.12 67.30 67.20
    LBYL-Net[12] DarkNet53 79.63 82.91 74.15 68.64 73.38 59.48 62.70 - -
    1阶段
    Transformer
    方法
    TransVG[14] ResNet101 81.02 82.72 78.35 64.82 70.70 56.94 67.02 68.67 67.73
    VLTVG[15] ResNet101 84.77 87.24 80.49 74.19 78.93 65.17 72.98 76.04 74.18
    VGTR[16] ResNet101 79.30 82.16 74.38 64.40 70.85 55.84 64.05 66.83 67.28
    CMI[17] ResNet101 81.92 83.40 77.37 68.49 72.18 60.30 68.39 69.08 69.04
    TransCP[18] ResNet50 84.25 87.38 79.78 73.07 78.05 63.35 72.60 - -
    ScanFormer[19] ViLT 83.40 85.86 78.81 72.96 77.57 62.50 - 74.10 74.14
    EEVG[20] ResNet101 82.19 85.34 77.18 71.35 76.76 60.73 - 70.18 71.28
    ThiVG(our) ResNet50 84.78 87.68 80.04 73.94 78.55 64.55 73.93 76.14 74.88
    ThiVG(our) ResNet101 85.01 87.96 80.47 74.44 79.21 65.27 74.41 77.01 75.21
    注:红色、绿色、蓝色数字分别为前3名值,“-“表示未发现。
    下载: 导出CSV

    表  2  ReferItGame和Flickr30k Entities数据集对比实验(%)

    方法 模型 骨干网络 ReferItGame
    test
    Flickr30k
    Entities test
    2阶段
    方法
    CMN[3] VGG16 28.33 -
    MAttNet[4] ResNet101 29.04 -
    CITE[7] ResNet101 35.07 61.33
    DDPN[8] ResNet101 63.00 73.30
    1阶段
    方法
    FAOA[9] DarkNet53 60.67 68.71
    RCCF[10] DLA34 63.79 -
    ReSC-Large[11] DarkNet53 64.60 69.28
    LBYL-Net[12] DarkNet53 64.47 -
    1阶段
    Transformer
    方法
    TransVG[14] ResNet101 70.73 79.10
    VLTVG[15] ResNet101 71.98 79.84
    VGTR[16] ResNet101 - 75.32
    CMI[17] ResNet101 71.07 79.15
    TransCP[18] ResNet50 72.05 80.04
    ScanFormer[19] ViLT 68.85 -
    LMSVA[21] ResNet50 72.98 -
    ThiVG(our) ResNet50 73.42 80.69
    ThiVG(our) ResNet101 74.03 81.26
    注:红色、绿色、蓝色数字分别为前3名值,“—“表示未发现。
    下载: 导出CSV

    表  3  ThiVG与经典模型对比结果

    模型 训练参数量
    (M)
    训练平均时间
    (ms)
    推理FLOPs
    (G)
    推理平均时间
    (ms)
    TransVG[14] 149.52 183.75 41.28 31.93
    VLTVG[15] 151.26 174.93 38.74 30.46
    ThiVG(our) 154.89 192.56 42.73 32.57
    下载: 导出CSV

    表  4  本文各个模块组合消融实验结果(%)

    模型命名分层特征
    交互
    多粒度
    文本感知
    多粒度
    文本感知
    定位准确率
    (HFI)(无ATW)(有ATW)valtestAtestB
    ThiVG(1)80.2383.3776.38
    ThiVG(2)83.1586.3178.65
    ThiVG(3)81.3684.4977.91
    ThiVG(4)82.0785.2178.45
    ThiVG(5)84.1787.3479.61
    ThiVG(6)84.7887.6880.04
    下载: 导出CSV

    表  5  分层特征交互模块层数实验结果(%)

    层数定位准确率
    valtestAtestB
    082.0785.2178.45
    183.8986.8879.25
    284.1787.0279.53
    484.5287.4179.90
    684.7887.6880.04
    下载: 导出CSV

    表  6  不同感知偏置的实验结果(%)

    方位词
    激励($ {\delta _{\mathrm{D}}} $)
    停用词
    抑制($ {\delta _{\mathrm{S}}} $)
    定位准确率(无ATW) 定位准确率(有ATW)
    val testA testB val testA testB
    0.5 –0.5 84.03 87.21 79.49 84.57 87.42 79.78
    1.0 –1.0 84.27 87.41 79.69 84.69 87.60 79.95
    1.5 –1.5 84.21 87.38 79.66 84.73 87.63 79.97
    2.0 –2.0 84.17 87.34 79.61 84.78 87.68 80.04
    2.5 –2.5 83.97 87.16 79.45 84.72 87.65 80.03
    下载: 导出CSV

    表  7  图文跨模态融合模块Transformer编码器层数实验结果(%)

    Transformer
    编码器层数
    定位准确率
    valtestAtestB
    156.2161.2752.63
    268.7472.0363.37
    477.5280.2572.91
    684.7887.6880.04
    下载: 导出CSV
  • [1] XIAO Linhui, YANG Xiaoshan, LAN Xiangyuan, et al. Towards visual grounding: A survey[EB/OL]. https://arxiv.org/abs/2412.20206, 2024.
    [2] LI Yong, WANG Yuanzhi, and CUI Zhen. Decoupled multimodal distilling for emotion recognition[C]. The 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 6631–6640. doi: 10.1109/CVPR52729.2023.00641.
    [3] HU Ronghang, ROHRBACH M, Andreas J, et al. Modeling relationships in referential expressions with compositional modular networks[C]. The 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4418–4427. doi: 10.1109/CVPR.2017.470.
    [4] YU Licheng, LIN Zhe, SHEN Xiaohui, et al. MAttNet: Modular attention network for referring expression comprehension[C]. The 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 1307–1315. doi: 10.1109/CVPR.2018.00142.
    [5] YANG Sibei, LI Guanbin, and YU Yizhou. Dynamic graph attention for referring expression comprehension[C]. The 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 4643–4652. doi: 10.1109/ICCV.2019.00474.
    [6] CHEN Long, MA Wenbo, XIAO Jun, et al. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding[C]. The 35th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2021: 1036–1044. doi: 10.1609/aaai.v35i2.16188. (查阅网上资料,未找到本条文献出版地信息,请确认).
    [7] PLUMMER B A, KORDAS P, KIAPOUR M H, et al. Conditional image-text embedding networks[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 258–274. doi: 10.1007/978-3-030-01258-8_16.
    [8] YU Zhou, YU Jun, XIANG Chenchao, et al. Rethinking diversified and discriminative proposal generation for visual grounding[C]. The Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 2018: 1114–1120. doi: 10.24963/ijcai.2018/155.
    [9] YANG Zhengyuan, GONG Boqing, WANG Liwei, et al. A fast and accurate one-stage approach to visual grounding[C]. The 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 4682–4692. doi: 10.1109/ICCV.2019.00478.
    [10] LIAO Yue, LIU Si, LI Guanbin, et al. A real-time cross-modality correlation filtering method for referring expression comprehension[C]. The 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 10877–10886. doi: 10.1109/CVPR42600.2020.01089.
    [11] YANG Zhengyuan, CHEN Tianlang, WANG Liwei, et al. Improving one-stage visual grounding by recursive sub-query construction[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 387–404. doi: 10.1007/978-3-030-58568-6_23.
    [12] HUANG Binbin, LIAN Dongze, LUO Weixin, et al. Look before you leap: Learning landmark features for one-stage visual grounding[C]. The 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 16883–16892. doi: 10.1109/CVPR46437.2021.01661.
    [13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
    [14] DENG Jiajun, YANG Zhengyuan, CHEN Tianlang, et al. TransVG: End-to-end visual grounding with transformers[C]. The 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 1749–1759. doi: 10.1109/ICCV48922.2021.00179.
    [15] YANG Li, XU Yan, YUAN Chunfeng, et al. Improving visual grounding with visual-linguistic verification and iterative reasoning[C]. The 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 9489–9498. doi: 10.1109/CVPR52688.2022.00928.
    [16] DU Ye, FU Zehua, LIU Qingjie, et al. Visual grounding with transformers[C]. The 2022 IEEE International Conference on Multimedia and Expo, Taipei, China, 2022: 1–6. doi: 10.1109/ICME52920.2022.9859880.
    [17] LI Kun, LI Jiaxiu, GUO Dan, et al. Transformer-based visual grounding with cross-modality interaction[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023, 19(6): 183. doi: 10.1145/3587251.
    [18] TANG Wei, LI Liang, LIU Xuejing, et al. Context disentangling and prototype inheriting for robust visual grounding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(5): 3213–3229. doi: 10.1109/TPAMI.2023.3339628.
    [19] SU Wei, MIAO Peihan, DOU Huanzhang, et al. ScanFormer: Referring expression comprehension by iteratively scanning[C]. The 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 13449–13458. doi: 10.1109/cvpr52733.2024.01277.
    [20] CHEN Wei, CHEN Long, and WU Yu. An efficient and effective transformer decoder-based framework for multi-task visual grounding[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 125–141. doi: 10.1007/978-3-031-72995-9_8.
    [21] YAO Haibo, WANG Lipeng, CAI Chengtao, et al. Language conditioned multi-scale visual attention networks for visual grounding[J]. Image and Vision Computing, 2024, 150: 105242. doi: 10.1016/j.imavis.2024.105242.
    [22] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 213–229. doi: 10.1007/978-3-030-58452-8_13.
    [23] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. The 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, Minneapolis, USA, 2019: 4171–4186. doi: 10.18653/v1/N19-1423.
    [24] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. The 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.
    [25] 丁博, 张立宝, 秦健, 等. 语义增强图像-文本预训练模型的零样本三维模型分类[J]. 电子与信息学报, 2024, 46(8): 3314–3323. doi: 10.11999/JEIT231161.

    DING Bo, ZHANG Libao, QIN Jian, et al. Zero-shot 3D shape classification based on semantic-enhanced language-image pre-training model[J]. Journal of Electronics & Information Technology, 2024, 46(8): 3314–3323. doi: 10.11999/JEIT231161.
    [26] JAIN R, RAI R S, JAIN S, et al. Real time sentiment analysis of natural language using multimedia input[J]. Multimedia Tools and Applications, 2023, 82(26): 41021–41036. doi: 10.1007/s11042-023-15213-3.
    [27] HAN Tian, ZHANG Zhu, REN Mingyuan, et al. Text emotion recognition based on XLNet-BiGRU-Att[J]. Electronics, 2023, 12(12): 2704. doi: 10.3390/electronics12122704.
    [28] ZHANG Xiangsen, WU Zhongqiang, LIU Ke, et al. Text sentiment classification based on BERT embedding and sliced multi-head self-attention Bi-GRU[J]. Sensors, 2023, 23(3): 1481. doi: 10.3390/s23031481.
    [29] YU Jun, ZHANG Donglin, SHU Zhenqiu, et al. Adaptive multi-modal fusion hashing via Hadamard matrix[J]. Applied Intelligence, 2022, 52(15): 17170–17184. doi: 10.1007/s10489-022-03367-w.
    [30] YU Licheng, POIRSON P, YANG Shan, et al. Modeling context in referring expressions[C]. The 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 69–85. doi: 10.1007/978-3-319-46475-6_5.
    [31] MAO Junhua, HUANG J, TOSHEV A, et al. Generation and comprehension of unambiguous object descriptions[C]. The 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 11–20. doi: 10.1109/CVPR.2016.9.
    [32] KAZEMZADEH S, ORDONEZ V, MATTEN M, et al. ReferItGame: Referring to objects in photographs of natural scenes[C]. The 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 787–798, Doha, Qatar. doi: 10.3115/v1/D14-1086.
    [33] PLUMMER B A, WANG Liwei, CERVANTES C M, et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models[C]. The 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 2641–2649. doi: 10.1109/ICCV.2015.303.
  • 加载中
图(7) / 表(8)
计量
  • 文章访问数:  30
  • HTML全文浏览量:  17
  • PDF下载量:  3
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-05-08
  • 修回日期:  2025-09-12
  • 网络出版日期:  2025-09-18

目录

    /

    返回文章
    返回