Cross-modal Video Moment Retrieval Based on Enhancing Significant Features

YANG Jinfu; LIU Yubin; SONG Lin; YAN Xue

doi:10.11999/JEIT211101

Volume 44 Issue 12

Dec. 2022

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2022 > 44(12): 4395-4404

YANG Jinfu, LIU Yubin, SONG Lin, YAN Xue. Cross-modal Video Moment Retrieval Based on Enhancing Significant Features[J]. Journal of Electronics & Information Technology, 2022, 44(12): 4395-4404. doi: 10.11999/JEIT211101

Citation:

YANG Jinfu, LIU Yubin, SONG Lin, YAN Xue. Cross-modal Video Moment Retrieval Based on Enhancing Significant Features[J]. Journal of Electronics & Information Technology, 2022, 44(12): 4395-4404. doi: 10.11999/JEIT211101

Citation:

PDF( 4893 KB)

Cross-modal Video Moment Retrieval Based on Enhancing Significant Features

doi: 10.11999/JEIT211101

YANG Jinfu^{1, 2
,
,},
LIU Yubin¹,
SONG Lin¹,
YAN Xue¹

1.
Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
2.
Beijing Key Laboratory of Computational Intelligence and Intelligent Systems, Beijing 100124, China

Funds: The National Natural Science Foundation of China (61973009)

Received Date: 2021-10-09
Rev Recd Date: 2022-03-26

Available Online: 2022-04-02

Publish Date: 2022-12-16

Abstract

Abstract

With the continuous development of video acquisition equipment and technology, the number of videos has grown rapidly. It is a challenging task in video retrieval to find target video moments accurately in massive videos. Cross-modal video moment retrieval is to find a moment matching the query from the video database. Existing works focus mostly on matching the text with the moment, while ignoring the context content in the adjacent moment. As a result, there exists the problem of insufficient expression of feature relation. In this paper, a novel moment retrieval network is proposed, which highlights the significant features through residual channel attention. At the same time, a temporal adjacent network is designed to capture the context information of the adjacent moment. Experimental results show that the proposed method achieves better performance than the mainstream candidate matching based and video-text features relation based methods.
- Cross-modal video moment retrieval,
- Temporal adjacent network,
- Residual channel attention,
- Feature relationship

FullText(HTML)

References(31)

References

[1]	GAO Jiyang, SUN Chen, YANG Zhenheng, et al. TALL: Temporal activity localization via language query[C]. 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5277–5285.
[2]	GE Runzhou, GAO Jiyang, CHEN Kan, et al. MAC: Mining activity concepts for language-based temporal localization[C]. 2019 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2019: 245–253.
[3]	LIU Meng, WANG Xiang, NIE Liqiang, et al. Cross-modal moment localization in videos[C]. The 26th ACM International Conference on Multimedia, Seoul, Korea, 2018: 843–851.
[4]	ZHANG Songyang, PENG Houwen, FU Jianlong, et al. Learning 2D temporal adjacent networks for moment localization with natural language[C]. The 34th AAAI Conference on Artificial Intelligence, New York, USA, 2020: 12870–12877.
[5]	LIU Meng, WANG Xiang, NIE Liqiang, et al. Attentive moment retrieval in videos[C]. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, USA, 2018: 15–24.
[6]	NING Ke, XIE Lingxi, LIU Jianzhuang, et al. Interaction-integrated network for natural language moment localization[J]. IEEE Transactions on Image Processing, 2021, 30: 2538–2548. doi: 10.1109/TIP.2021.3052086
[7]	HAHN M, KADAV A, REHG J M, et al. Tripping through time: Efficient localization of activities in videos[C]. The 31st British Machine Vision Conference, Manchester, UK, 2020.
[8]	YUAN Yitian, MEI Tao, and ZHU Wenwu. To find where you talk: Temporal sentence localization in video with attention based location regression[C]. The 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA, 2019: 9159–9166.
[9]	GHOSH S, AGARWAL A, PAREKH Z, et al. ExCL: Extractive clip localization using natural language descriptions[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, 2019: 1984–1990.
[10]	SUN Xiaoyang, WANG Hanli, and HE Bin. MABAN: Multi-agent boundary-aware network for natural language moment retrieval[J]. IEEE Transactions on Image Processing, 2021, 30: 5589–5599. doi: 10.1109/TIP.2021.3086591
[11]	RODRIGUEZ-OPAZO C, MARRESE-TAYLOR E, FERNANDO B, et al. DORi: Discovering object relationships for moment localization of a natural language query in a video[C]. 2021 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2021.
[12]	XU Huijuan, HE Kun, PLUMMER B A, et al. Multilevel language and vision integration for text-to-clip retrieval[C]. The 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA, 2019: 9062–9069.
[13]	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 4489–4497.
[14]	HOCHREITER S and SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735–1780. doi: 10.1162/neco.1997.9.8.1735
[15]	CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation[C]. 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 1724–1734.
[16]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
[17]	HU Jie, SHEN Li, and SUN Gang. Squeeze-and-excitation networks[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141.
[18]	WOO S, PARK J, LEE J, et al. CBAM: Convolutional block attention module[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 3–19.
[19]	ZHANG Yulun, LI Kunpeng, LI Kai, et al. Image super-resolution using very deep residual channel attention networks[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 294–310.
[20]	WANG Qilong, WU Banggu, ZHU Pengfei, et al. ECA-Net: Efficient channel attention for deep convolutional neural networks[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020.
[21]	WANG Xiaolong, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7794–7803.
[22]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
[23]	ZHENG Zhaohui, WANG Ping, LIU Wei, et al. Distance-IoU Loss: Faster and better learning for bounding box regression[C]. The 34th AAAI Conference on Artificial Intelligence, New York, USA, 2020: 12993–13000.
[24]	ZHENG Zhaohui, WANG Ping, REN Dongwei, et al. Enhancing geometric factors in model learning and inference for object detection and instance segmentation[J]. IEEE Transactions on Cybernetics, 2022, 52(8): 8574–8586.
[25]	REZATOFIGHI H, TSOI N, GWAK J, et al. Generalized intersection over union: A metric and a loss for bounding box regression[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 658–666.
[26]	REGNERI M, ROHRBACH M, WETZEL D, et al. Grounding action descriptions in videos[J]. Transactions of the Association for Computational Linguistics, 2013, 1: 25–36. doi: 10.1162/tacl_a_00207
[27]	KRISHNA R, HATA K, REN F, et al. Dense-captioning events in videos[C]. 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 706–715.
[28]	ROHRBACH M, REGNERI M, ANDRILUKA M, et al. Script data for attribute-based recognition of composite activities[C]. The 12th European Conference on Computer Vision, Florence, Italy, 2012: 144–157.
[29]	ZHANG Da, DAI Xiyang, WANG Xin, et al. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019.
[30]	PENNINGTON J, SOCHER R, and MANNING C. GloVe: Global vectors for word representation[C]. 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 1532–1543.
[31]	HU Yupeng, LIU Meng, SUN Xiaobin, et al. Video moment localization via deep cross-modal hashing[J]. IEEE Transactions on Image Processing, 2021, 30: 4667–4677. doi: 10.1109/TIP.2021.3073867