高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于显著特征增强的跨模态视频片段检索

杨金福 刘玉斌 宋琳 闫雪

杨金福, 刘玉斌, 宋琳, 闫雪. 基于显著特征增强的跨模态视频片段检索[J]. 电子与信息学报, 2022, 44(12): 4395-4404. doi: 10.11999/JEIT211101
引用本文: 杨金福, 刘玉斌, 宋琳, 闫雪. 基于显著特征增强的跨模态视频片段检索[J]. 电子与信息学报, 2022, 44(12): 4395-4404. doi: 10.11999/JEIT211101
YANG Jinfu, LIU Yubin, SONG Lin, YAN Xue. Cross-modal Video Moment Retrieval Based on Enhancing Significant Features[J]. Journal of Electronics & Information Technology, 2022, 44(12): 4395-4404. doi: 10.11999/JEIT211101
Citation: YANG Jinfu, LIU Yubin, SONG Lin, YAN Xue. Cross-modal Video Moment Retrieval Based on Enhancing Significant Features[J]. Journal of Electronics & Information Technology, 2022, 44(12): 4395-4404. doi: 10.11999/JEIT211101

基于显著特征增强的跨模态视频片段检索

doi: 10.11999/JEIT211101
基金项目: 国家自然科学基金(61973009)
详细信息
    作者简介:

    杨金福:男,教授,博士生导师,研究方向为模式识别、人工智能及其应用

    刘玉斌:男,硕士生,研究方向为计算机视觉,视频理解

    宋琳:女,硕士生,研究方向为计算机视觉,深度学习

    闫雪:女,硕士生,研究方向为计算机视觉,深度学习

    通讯作者:

    杨金福 jfyang@bjut.edu.cn

  • 中图分类号: TN911.73; TP391

Cross-modal Video Moment Retrieval Based on Enhancing Significant Features

Funds: The National Natural Science Foundation of China (61973009)
  • 摘要: 随着视频获取设备和技术的不断发展,视频数量增长快速,在海量视频中精准查找目标视频片段是具有挑战的任务。跨模态视频片段检索旨在根据输入一段查询文本,模型能够从视频库中找出符合描述的视频片段。现有的研究工作多是关注文本与候选视频片段的匹配,忽略了视频上下文的“语境”信息,在视频理解时,存在对特征关系表达不足的问题。针对此,该文提出一种基于显著特征增强的跨模态视频片段检索方法,通过构建时间相邻网络学习视频的上下文信息,然后使用轻量化残差通道注意力突出视频片段的显著特征,提升神经网络对视频语义的理解能力。在公开的数据集TACoS和ActivityNet Captions的实验结果表明,该文所提方法能更好地完成视频片段检索任务,比主流的基于匹配的方法和基于视频-文本特征关系的方法取得了更好的表现。
  • 图  1  模型整体结构

    图  2  TAN特征图构建示意图(n=5)

    图  3  使用CIoU选取候选片段的示意图

    图  4  CIoU与IoU的对比实验图(Rank@1)

    图  5  不同的注意力对模型召回率的影响(IoU=0.5)

    图  6  超参数$ k $对模型召回率的影响(IoU=0.5)

    图  7  超参数$ \alpha $$ \beta $对模型召回率的影响(Rank@1 IoU=0.5)

    图  8  在TACoS上的部分可视化结果

    表  1  SFEN在TACoS数据集上的召回率

    方法Rank@1Rank@5
    IoU=0.1IoU=0.3IoU=0.5IoU=0.1IoU=0.3IoU=0.5
    LTAN[3]20.415.49.945.631.220.1
    CTRL[1]24.318.313.348.736.725.4
    ACL[2]31.624.220.057.942.230.7
    ABLR[8]34.719.59.4
    DORi[11]31.828.7
    ACRN[5]24.219.514.647.435.024.9
    CMHN[30]31.026.246.036.7
    QSPN[12]25.320.215.253.236.725.3
    IIN-C3D[6]31.529.352.746.1
    ExCL[9]45.528.0
    2D-TAN[4]47.637.325.370.357.845.0
    本文SFEN56.247.336.179.869.558.1
    下载: 导出CSV

    表  2  SFEN在ActivityNet Captions数据集上的召回率

    方法Rank@1Rank@5
    IoU=0.3IoU=0.5IoU=0.7IoU=0.3IoU=0.5IoU=0.7
    DORi[11]57.941.326.4
    CTRL[1]47.429.010.375.359.237.5
    ACRN[5]49.731.711.376.560.338.6
    MABAN[10]42.424.3
    CMHN[30]62.543.524.085.473.453.2
    QSPN[12]52.133.313.477.762.440.8
    TripNet[7]48.432.213.9
    ExCL[9]63.043.623.6
    2D-TAN[4]59.544.526.585.577.162.0
    本文SFEN59.945.628.785.577.363.0
    下载: 导出CSV

    表  3  SFEN的消融实验结果

    TACoS数据集(Rank@1)ActivityNet Captions数据集(Rank@1)
    CIoURCA-W$ \mathcal{R} $IoU=0.1IoU=0.3IoU=0.5IoU=0.7IoU=0.1IoU=0.3IoU=0.5IoU=0.7
    56.247.336.120.377.159.945.628.7
    ×55.244.133.620.076.856.541.027.1
    ×53.944.033.218.176.458.444.228.4
    ×55.945.734.117.777.458.744.126.3
    下载: 导出CSV

    表  4  SFEN的时间复杂度和计算量

    模块时间复杂度计算量(G)
    时间相邻网络$ O({K^2} \times {C_{{\text{in}}}} \times {C_{{\text{out}}}} \times N) $4.27
    轻量化残差通道注意力$ O({C_{{\text{in}}}} \times N) $0.02
    特征融合与视频时刻定位$ O(Z \times {K^2} \times {C_{{\text{in}}}} \times {C_{{\text{out}}}} \times N) $270.66
    下载: 导出CSV

    表  5  SFEN使用不同的注意力模型在TACoS数据集上的对比结果

    方法Rank@1Rank@5推理时间(s)模型大小(M)计算量(G)
    IoU=0.1IoU=0.3IoU=0.5IoU=0.1IoU=0.3IoU=0.5
    baseline[4]47.636.125.370.357.845.00.151731.0F=274.93
    Non-local[21]52.541.128.975.261.548.70.178735.3F+8.54
    SE[17]49.237.525.476.263.350.10.550737.4F+4.27
    RCA[19]53.243.332.677.663.752.60.565734.3F+8.55
    ECA[20]53.542.430.474.963.650.30.452731.1F+0.05
    本文RCA-W56.247.336.179.869.558.10.153731.1F+0.02
    下载: 导出CSV
  • [1] GAO Jiyang, SUN Chen, YANG Zhenheng, et al. TALL: Temporal activity localization via language query[C]. 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5277–5285.
    [2] GE Runzhou, GAO Jiyang, CHEN Kan, et al. MAC: Mining activity concepts for language-based temporal localization[C]. 2019 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2019: 245–253.
    [3] LIU Meng, WANG Xiang, NIE Liqiang, et al. Cross-modal moment localization in videos[C]. The 26th ACM International Conference on Multimedia, Seoul, Korea, 2018: 843–851.
    [4] ZHANG Songyang, PENG Houwen, FU Jianlong, et al. Learning 2D temporal adjacent networks for moment localization with natural language[C]. The 34th AAAI Conference on Artificial Intelligence, New York, USA, 2020: 12870–12877.
    [5] LIU Meng, WANG Xiang, NIE Liqiang, et al. Attentive moment retrieval in videos[C]. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, USA, 2018: 15–24.
    [6] NING Ke, XIE Lingxi, LIU Jianzhuang, et al. Interaction-integrated network for natural language moment localization[J]. IEEE Transactions on Image Processing, 2021, 30: 2538–2548. doi: 10.1109/TIP.2021.3052086
    [7] HAHN M, KADAV A, REHG J M, et al. Tripping through time: Efficient localization of activities in videos[C]. The 31st British Machine Vision Conference, Manchester, UK, 2020.
    [8] YUAN Yitian, MEI Tao, and ZHU Wenwu. To find where you talk: Temporal sentence localization in video with attention based location regression[C]. The 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA, 2019: 9159–9166.
    [9] GHOSH S, AGARWAL A, PAREKH Z, et al. ExCL: Extractive clip localization using natural language descriptions[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, 2019: 1984–1990.
    [10] SUN Xiaoyang, WANG Hanli, and HE Bin. MABAN: Multi-agent boundary-aware network for natural language moment retrieval[J]. IEEE Transactions on Image Processing, 2021, 30: 5589–5599. doi: 10.1109/TIP.2021.3086591
    [11] RODRIGUEZ-OPAZO C, MARRESE-TAYLOR E, FERNANDO B, et al. DORi: Discovering object relationships for moment localization of a natural language query in a video[C]. 2021 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2021.
    [12] XU Huijuan, HE Kun, PLUMMER B A, et al. Multilevel language and vision integration for text-to-clip retrieval[C]. The 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA, 2019: 9062–9069.
    [13] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 4489–4497.
    [14] HOCHREITER S and SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735–1780. doi: 10.1162/neco.1997.9.8.1735
    [15] CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation[C]. 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 1724–1734.
    [16] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
    [17] HU Jie, SHEN Li, and SUN Gang. Squeeze-and-excitation networks[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141.
    [18] WOO S, PARK J, LEE J, et al. CBAM: Convolutional block attention module[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 3–19.
    [19] ZHANG Yulun, LI Kunpeng, LI Kai, et al. Image super-resolution using very deep residual channel attention networks[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 294–310.
    [20] WANG Qilong, WU Banggu, ZHU Pengfei, et al. ECA-Net: Efficient channel attention for deep convolutional neural networks[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020.
    [21] WANG Xiaolong, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7794–7803.
    [22] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
    [23] ZHENG Zhaohui, WANG Ping, LIU Wei, et al. Distance-IoU Loss: Faster and better learning for bounding box regression[C]. The 34th AAAI Conference on Artificial Intelligence, New York, USA, 2020: 12993–13000.
    [24] ZHENG Zhaohui, WANG Ping, REN Dongwei, et al. Enhancing geometric factors in model learning and inference for object detection and instance segmentation[J]. IEEE Transactions on Cybernetics, 2022, 52(8): 8574–8586.
    [25] REZATOFIGHI H, TSOI N, GWAK J, et al. Generalized intersection over union: A metric and a loss for bounding box regression[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 658–666.
    [26] REGNERI M, ROHRBACH M, WETZEL D, et al. Grounding action descriptions in videos[J]. Transactions of the Association for Computational Linguistics, 2013, 1: 25–36. doi: 10.1162/tacl_a_00207
    [27] KRISHNA R, HATA K, REN F, et al. Dense-captioning events in videos[C]. 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 706–715.
    [28] ROHRBACH M, REGNERI M, ANDRILUKA M, et al. Script data for attribute-based recognition of composite activities[C]. The 12th European Conference on Computer Vision, Florence, Italy, 2012: 144–157.
    [29] ZHANG Da, DAI Xiyang, WANG Xin, et al. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019.
    [30] PENNINGTON J, SOCHER R, and MANNING C. GloVe: Global vectors for word representation[C]. 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 1532–1543.
    [31] HU Yupeng, LIU Meng, SUN Xiaobin, et al. Video moment localization via deep cross-modal hashing[J]. IEEE Transactions on Image Processing, 2021, 30: 4667–4677. doi: 10.1109/TIP.2021.3073867
  • 加载中
图(8) / 表(5)
计量
  • 文章访问数:  779
  • HTML全文浏览量:  436
  • PDF下载量:  110
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-10-09
  • 修回日期:  2022-03-26
  • 网络出版日期:  2022-04-02
  • 刊出日期:  2022-12-16

目录

    /

    返回文章
    返回