三维卷积神经网络及其在视频理解领域中的应用研究

白静; 杨瞻源; 彭斌; 李文静

doi:10.11999/JEIT220596

三维卷积神经网络及其在视频理解领域中的应用研究

doi: 10.11999/JEIT220596 cstr: 32379.14.JEIT220596

白静^{1, 2},
杨瞻源^1, ,,
彭斌¹,
李文静¹

1.
北方民族大学计算机科学与工程学院银川 750021
2.
国家民委图像图形智能处理实验室银川 750021

基金项目: 国家自然科学基金(62162001, 61762003)，宁夏自然科学基金(2022AAC02041)，宁夏优秀人才支持计划，北方民族大学创新项目(YCX22194)

详细信息

作者简介:
白静：女，教授，硕士生导师，研究方向为机器学习、深度表征学习、计算机视觉应用

杨瞻源：男，硕士生，研究方向为图像处理与计算机视觉、深度表征学习

彭斌：男，硕士生，研究方向为图像处理与计算机视觉、深度表征学习

李文静：女，硕士生，研究方向为图像处理与计算机视觉、深度表征学习

通讯作者:
杨瞻源　1273907064@qq.com

中图分类号: TP399
计量
- 文章访问数: 2197
- HTML全文浏览量: 1931
- PDF下载量: 248
- 被引次数: 0
出版历程
- 收稿日期: 2022-05-11
- 修回日期: 2022-11-18
- 网络出版日期: 2022-11-21
- 刊出日期: 2023-06-10

Research on 3D Convolutional Neural Network and Its Application to Video Understanding

BAI Jing^{1, 2},
YANG Zhanyuan^{1
, ,},
PENG Bin¹,
LI Wenjing¹

1.
School of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China
2.
National Ethnic Affairs Commission Image Graphics Intelligent Processing Laboratory, Yinchuan 750021, China

Funds: The National Natural Science Foundation of China (62162001, 61762003), The Natural Science Foundation of Ningxia Province of China (2022AAC02041), The CAS “Light of West China” Program, The Ningxia Excellent Talent Program, North Minzu University Innovation Project(YCX22194)

摘要

摘要: 3维卷积神经网络(3D CNN)是近几年来深度学习研究中的热点，在计算机视觉领域取得了诸多成就。虽然研究多年且成果丰富，但目前仍缺少关于此内容全面、细致的综述。基于此，该文从以下几个方面对其进行综述：首先阐述3维卷积神经网络的基本原理和模型结构，接着从网络结构、网络内部和优化方法总结3维卷积神经网络的相关改进工作，然后对3维卷积神经网络在视频理解领域中的应用进行总结，最后总结全文内容并对未来发展方向进行展望。该文针对3维卷积神经网络的最新研究进展以及在视频理解领域中的应用进行了系统的综述，对3维卷积神经网络的研究发展具有一定的积极意义。
- 视频理解 /
- 深度学习 /
- 3维卷积神经网络 /
- 网络结构
Abstract: 3D Convolutional Neural Network (3D CNN) has been a hot topic in deep learning research over the last few years and has made great achievements in computer vision. Despite years of research and abundant results, a comprehensive and detailed review of this content is still lacking. In this paper, the 3D convolutional neural network is introduced in the following aspects. Firstly, the rationale and model structure of 3D convolutional neural network are put forward. Then the improvement of 3D convolutional neural network is summarized from the network structure, network interior and optimization methods. After that the application of 3D convolutional neural network to the field of video understanding is explained. Finally, the contents summary of the paper and future development. This paper provides a systematic review of the latest research progress of 3D convolutional neural networks and their applications in the field of video understanding, which is of positive significance to the research and development of 3D convolutional neural network.
- Video understanding /
- Deep learning /
- 3D Convolutional Neural Network (3D CNN) /
- Network structure

HTML全文

图 1 3D CNN网络模型改进思路

下载: 全尺寸图片幻灯片

图 2 网络深度方向的改进

下载: 全尺寸图片幻灯片

图 3 网络宽度方向的改进

下载: 全尺寸图片幻灯片

图 4 卷积层的改进

下载: 全尺寸图片幻灯片

图 5 3D CNN在行为识别任务中的应用

下载: 全尺寸图片幻灯片

表 1 常用的行为识别数据集

数据集	类别数	视频数	训练集	测试集	动作类型
UCF-101^[35]	101	13320	9324	3996	人物交互、肢体动作、人人交互、乐器演奏、体育运动
HMDB-51^[36]	51	6766	4736	2030	常见/复杂的面部动作、常见/复杂的肢体动作、多人交互动作
Kinetics400^[37]	400	254380	234619	19761	人物交互和人人交互
Sports-1M^[38]	487	1133158	793211	339947	运动视频

下载: 导出CSV

表 2 行为识别任务中不同3D CNN在不同数据集上的性能对比(表内数据源于相关论文)

改进角度	年份	网络	不同数据集上的准确率(%)				参数量 (M)	计算速率 (VPS/GFLOPs)
改进角度	年份	网络	UCF-101	HMDB-51	Kinetics400	Sports-1M	参数量 (M)	计算速率 (VPS/GFLOPs)
基础结构	2015	C3D	82.3	40.4	–	85.2	33.4	–	/	–
残差连接	2017	Res3D	85.8	54.9	–	87.8	33.2	0.9	/	–
残差连接	2021	R-M3D	93.2	65.4	–	–	–	–	/	–
卷积核拆分	2017	P3D	88.6	–	–	87.4	–	<2.0	/	–
	2018	R(2+1)D	97.3	78.7	75.4	91.9	33.3	–	/	–
	2018	S3D-G	96.8	75.9	76.2	–	11.6	–	/	71.4
3维膨胀卷积	2017	I3D	93.4	66.4	72.6	–	25.0	–	/	107.9
2D+3D	2018	ARTNet	93.5	67.6	72.4	–	33.4	2.9	/	20.0
2D+3D	2018	ECO	94.8	72.4	–	–	–	28.2	/	–
多支路	2022	3D Dual-Stream-SRU	95.3	76.5	–	–	–	–	/	–
知识蒸馏	2020	D3D	97.6	80.5	75.9	–	–		/
注意力模块	2021	EAM+ResNet50	89.8	65.4	–	–	46.3	–	/	10.1
注意力模块	2021	DA+ResNext101	95.8	74.3	–	–	–	–	/	–

下载: 导出CSV

参考文献(48)

[1]	JI Shuiwang, XU Wei, YANG Ming, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221–231. doi: 10.1109/TPAMI.2012.59
[2]	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. The IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 4489–4497.
[3]	王磐, 强彦, 杨晓棠, 等. 基于双注意力3D-UNet的肺结节分割网络模型[J]. 计算机工程, 2021, 47(2): 307–313. doi: 10.19678/j.issn.1000-3428.0057019 WANG Pan, QIANG Yan, YANG Xiaotang, et al. Network model for lung nodule segmentation based on double attention 3D-UNet[J]. Computer Engineering, 2021, 47(2): 307–313. doi: 10.19678/j.issn.1000-3428.0057019
[4]	颜铭靖, 苏喜友. 基于三维空洞卷积残差神经网络的高光谱影像分类方法[J]. 光学学报, 2020, 40(16): 1628002. doi: 10.3788/AOS202040.1628002 YAN Mingjing and SU Xiyou. Hyperspectral image classification based on three-dimensional dilated convolutional residual neural network[J]. Acta Optica Sinica, 2020, 40(16): 1628002. doi: 10.3788/AOS202040.1628002
[5]	ALZUBAIDI L, ZHANG Jinglan, HUMAIDI A J, et al. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions[J]. Journal of Big Data, 2021, 8(1): 53. doi: 10.1186/s40537-021-00444-8
[6]	KATTENBORN T, LEITLOFF J, SCHIEFER F, et al. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2021, 173: 24–49. doi: 10.1016/j.isprsjprs.2020.12.010
[7]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
[8]	WU Peida, CUI Ziguan, GAN Zongliang, et al. Three-dimensional resNeXt network using feature fusion and label smoothing for hyperspectral image classification[J]. Sensors, 2020, 20(6): 1652. doi: 10.3390/s20061652
[9]	HUANG Gao, LIU Zhuang, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 2261–2269.
[10]	冯雨, 易本顺, 吴晨玥, 等. 基于三维卷积神经网络的肺结节识别研究[J]. 光学学报, 2019, 39(6): 0615006. doi: 10.3788/AOS201939.0615006 FENG Yu, YI Benshun, WU Chenyue, et al. Pulmonary nodule recognition based on three-dimensional convolution neural network[J]. Acta Optica Sinica, 2019, 39(6): 0615006. doi: 10.3788/AOS201939.0615006
[11]	段艳廷, 郑晓东, 胡莲莲, 等. 基于3D半密度卷积神经网络的断裂检测[J]. 地球物理学进展, 2019, 34(6): 2256–2261. doi: 10.6038/pg2019CC0367 DUAN Yanting, ZHENG Xiaodong, HU Lianlian, et al. Fault detection based on 3D semi-dense convolutional neural network[J]. Progress in Geophysics, 2019, 34(6): 2256–2261. doi: 10.6038/pg2019CC0367
[12]	丰艳, 张甜甜, 王传旭. 基于伪3D残差网络与交互关系建模的群组行为识别方法[J]. 电子学报, 2020, 48(7): 1269–1275. doi: 10.3969/j.issn.0372-2112.2020.07.004 FENG Yan, ZHANG Tiantian, and WANG Chuanxu. Group activity recognition method based on pseudo 3D residual network and interaction modeling[J]. Acta Electronica Sinica, 2020, 48(7): 1269–1275. doi: 10.3969/j.issn.0372-2112.2020.07.004
[13]	ZOLFAGHARI M, SINGH K, and BROX T. ECO: Efficient convolutional network for online video understanding[C]. The 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 713–730.
[14]	LU Changlei, LIU Bin, ZHOU Wenbo, et al. Deepfake video detection using 3D-attentional inception convolutional neural network[C]. 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, USA, 2021: 3572–3576.
[15]	胡正平, 刁鹏成, 张瑞雪, 等. 3D多支路聚合轻量网络视频行为识别算法研究[J]. 电子学报, 2020, 48(7): 1261–1268. doi: 10.3969/j.issn.0372-2112.2020.07.003 HU Zhengping, DIAO Pengcheng, ZHANG Ruixue, et al. Research on 3D multi-branch aggregated lightweight network video action recognition algorithm[J]. Acta Electronica Sinica, 2020, 48(7): 1261–1268. doi: 10.3969/j.issn.0372-2112.2020.07.003
[16]	MOLCHANOV P, YANG Xiaodong, GUPTA S, et al. Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2016: 4207–4215.
[17]	刘良鑫, 林勉芬, 钟良泉, 等. 基于3D双流卷积神经网络的异常行为检测[J]. 计算机系统应用, 2021, 30(5): 120–127. doi: 10.15888/j.cnki.csa.007912 LIU Liangxin, LIN Mianfen, ZHONG Liangquan, et al. Two-stream inflated 3D CNN for abnormal behavior detection[J]. Computer Systems &Applications, 2021, 30(5): 120–127. doi: 10.15888/j.cnki.csa.007912
[18]	HAN Yanling, WEI Cong, ZHOU Ruyan, et al. Combining 3D-CNN and squeeze-and-excitation networks for remote sensing sea ice image classification[J]. Mathematical Problems in Engineering, 2020, 2020: 8065396. doi: 10.1155/2020/8065396
[19]	王飞, 胡荣林, 金鹰. 基于3D-CBAM注意力机制的人体动作识别[J]. 南京师范大学学报:工程技术版, 2021, 21(1): 49–56. doi: 10.3969/j.issn.1672-1292.2021.01.008 WANG Fei, HU Ronglin, and JIN Ying. Human action recognition based on 3D-CBAM attention mechanism[J]. Journal of Nanjing Normal University:Engineering and Technology Edition, 2021, 21(1): 49–56. doi: 10.3969/j.issn.1672-1292.2021.01.008
[20]	XU Xuanang, ZHOU Fugen, and LIU Bo. Automatic bladder segmentation from CT images using deep CNN and 3D fully connected CRF-RNN[J]. International Journal of Computer Assisted Radiology and Surgery, 2018, 13(7): 967–975. doi: 10.1007/s11548-018-1733-7
[21]	XIE Saining, SUN Chen, HUANG J, et al. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification[C]. The 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 318–335.
[22]	WANG Limin, LI Wei, LI Wen, et al. Appearance-and-relation networks for video classification[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2017: 1430–1439.
[23]	LI Jiakun, WANG Tian, ZHOU Yi, et al. Using Gabor filter in 3D convolutional neural networks for human action recognition[C]. 2017 36th Chinese Control Conference (CCC), Dalian, China, 2017: 11139–11144.
[24]	QIU Zhaofan, YAO Ting, and MEI Tao. Learning spatio-temporal representation with pseudo-3D residual networks[C]. The IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5533–5541.
[25]	CARREIRA J and ZISSERMAN A. Quo Vadis, action recognition? A new model and the kinetics dataset[C]. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, 2017: 4724–4733.
[26]	YING Xinyi, WANG Longguang, WANG Yingqian, et al. Deformable 3D convolution for video super-resolution[J]. IEEE Signal Processing Letters, 2020, 27: 1500–1504. doi: 10.1109/LSP.2020.3013518
[27]	阮宏洋, 陈志澜, 程英升, 等. C-3D可变形卷积神经网络模型的肺结节检测[J]. 激光与光电子学进展, 2020, 57(4): 041013. doi: 10.3788/LOP57.041013 RUAN Hongyang, CHEN Zhilan, CHENG Yingsheng, et al. Detection of pulmonary nodules based on C-3D deformable convolutional neural network model[J]. Laser &Optoelectronics Progress, 2020, 57(4): 041013. doi: 10.3788/LOP57.041013
[28]	赵欣, 石德来, 王洪凯. 基于3D全卷积深度神经网络的脑白质病变分割方法[J]. 计算机与现代化, 2020(10): 44–50. doi: 10.3969/j.issn.1006-2475.2020.10.009 ZHAO Xin, SHI Delai, and WANG Hongkai. Segmentation of white matter lesions based on 3D full convolutional deep neural network[J]. Computer and Modernization, 2020(10): 44–50. doi: 10.3969/j.issn.1006-2475.2020.10.009
[29]	陆小玲, 吴海锋, 曾玉, 等. 3D迁移网络的阿尔茨海默症分类研究[J]. 计算机工程与应用, 2021, 57(16): 253–262. doi: 10.3778/j.issn.1002-8331.2005-0141 LU Xiaoling, WU Haifeng, ZENG Yu, et al. 3D transfer learning network for classification of Alzheimer's disease[J]. Computer Engineering and Applications, 2021, 57(16): 253–262. doi: 10.3778/j.issn.1002-8331.2005-0141
[30]	肖志云, 蒋家旭, 倪晨. 自适应深层残差3D-CNN高光谱图像快速分类算法[J]. 计算机辅助设计与图形学学报, 2019, 31(11): 2017–2029. doi: 10.3724/SP.J.1089.2019.17552 XIAO Zhiyun, JIANG Jiaxu, and NI Chen. Spectral-spatial classification of hyperspectral image based on self-adaptive deep residual 3D convolutional neural network[J]. Journal of Computer-Aided Design &Computer Graphics, 2019, 31(11): 2017–2029. doi: 10.3724/SP.J.1089.2019.17552
[31]	STROUD J C, ROSS D A, SUN Chen, et al. D3D: Distilled 3D networks for video action recognition[C]. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, USA, 2020: 614–623.
[32]	SINGH D, KUMAR V, KAUR M, et al. Screening of COVID-19 suspected subjects using multi-crossover genetic algorithm based dense convolutional neural network[J]. IEEE Access, 2021, 9: 142566–142580. doi: 10.1109/ACCESS.2021.3120717
[33]	ZHANG Yuxin, WANG Huan, LUO Yang, et al. Three-dimensional convolutional neural network pruning with regularization-based method[C]. 2019 IEEE International Conference on Image Processing (ICIP), Taipei, China, 2019: 4270–4274.
[34]	SHI Jixi, CHEN Zhihao, and COUTURIER R. Classification of pathological cases of myocardial infarction using convolutional neural network and random forest[C]. 11th International Workshop on Statistical Atlases and Computational Models of the Heart, Lima, Peru, 2021: 406–413.
[35]	SOOMRO K, ZAMIR A R, and SHAH M. UCF101: A dataset of 101 human actions classes from videos in the wild[EB/OL]. https://arxiv.org/abs/1212.0402, 2012.
[36]	KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A large video database for human motion recognition[C]. 2011 International Conference on Computer Vision, Barcelona, Spain, 2011: 2556–2563.
[37]	KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[EB/OL]. https://arxiv.org/abs/1705.06950, 2017.
[38]	KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1725–1732.
[39]	TRAN D, RAY J, SHOU Zheng, et al. ConvNet architecture search for spatiotemporal feature learning[EB/OL]. https://arxiv.org/abs/1708.05038, 2017.
[40]	ZONG Ming, WANG Ruili, CHEN Zhe, et al. Multi-cue based 3D residual network for action recognition[J]. Neural Computing and Applications, 2021, 33(10): 5167–5181. doi: 10.1007/s00521-020-05313-8
[41]	TRAN D, WANG Heng, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6450–6459.
[42]	ZHAI Jiecheng, YAO Xunxiang, DONG Guangyuan, et al. 3D dual-stream convolutional neural networks with simple recurrent unit network: A new framework for action recognition[C]. 2022 4th International Conference on Communications, Information System and Computer Engineering (CISCE), Shenzhen, China, 2022: 509–515.
[43]	JIANG Guanghao, JIANG Xiaoyan, FANG Zhijun, et al. An efficient attention module for 3D convolutional neural networks in action recognition[J]. Applied Intelligence, 2021, 51(10): 7043–7057. doi: 10.1007/s10489-021-02195-8
[44]	KIM D H, ANVAROV F, LEE J M, et al. Metric-based attention feature learning for video action recognition[J]. IEEE Access, 2021, 9: 39218–39228. doi: 10.1109/ACCESS.2021.3064934
[45]	WANG Xiaolong and GUPTA A. Unsupervised learning of visual representations using videos[C]. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015: 2794–2802.
[46]	YANG Xiangli, SONG Zixing, KING I, et al. A survey on deep semi-supervised learning[EB/OL]. https://arxiv.org/abs/2103.00550, 2021.
[47]	WANG Yaqing, YAO Quanming, KWOK J T, et al. Generalizing from a few examples: A survey on few-shot learning[J]. ACM Computing Surveys, 2021, 53(3): 63. doi: 10.1145/3386252
[48]	HAN Zongyan, FU Zhenyong, CHEN Shuo, et al. Contrastive embedding for generalized zero-shot learning[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 2371–2381.