Research on 3D Convolutional Neural Network and Its Application to Video Understanding

BAI Jing; YANG Zhanyuan; PENG Bin; LI Wenjing

doi:10.11999/JEIT220596

Volume 45 Issue 6

Jun. 2023

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2023 > 45(6): 2273-2283

BAI Jing, YANG Zhanyuan, PENG Bin, LI Wenjing. Research on 3D Convolutional Neural Network and Its Application to Video Understanding[J]. Journal of Electronics & Information Technology, 2023, 45(6): 2273-2283. doi: 10.11999/JEIT220596

Citation:

BAI Jing, YANG Zhanyuan, PENG Bin, LI Wenjing. Research on 3D Convolutional Neural Network and Its Application to Video Understanding[J]. Journal of Electronics & Information Technology, 2023, 45(6): 2273-2283. doi: 10.11999/JEIT220596

Citation:

PDF( 4062 KB)

Research on 3D Convolutional Neural Network and Its Application to Video Understanding

doi: 10.11999/JEIT220596

BAI Jing^{1, 2},
YANG Zhanyuan^{1
,
,},
PENG Bin¹,
LI Wenjing¹

1.
School of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China
2.
National Ethnic Affairs Commission Image Graphics Intelligent Processing Laboratory, Yinchuan 750021, China

Funds: The National Natural Science Foundation of China (62162001, 61762003), The Natural Science Foundation of Ningxia Province of China (2022AAC02041), The CAS “Light of West China” Program, The Ningxia Excellent Talent Program, North Minzu University Innovation Project(YCX22194)

Received Date: 2022-05-11
Rev Recd Date: 2022-11-18

Available Online: 2022-11-21

Publish Date: 2023-06-10

Abstract

Abstract

3D Convolutional Neural Network (3D CNN) has been a hot topic in deep learning research over the last few years and has made great achievements in computer vision. Despite years of research and abundant results, a comprehensive and detailed review of this content is still lacking. In this paper, the 3D convolutional neural network is introduced in the following aspects. Firstly, the rationale and model structure of 3D convolutional neural network are put forward. Then the improvement of 3D convolutional neural network is summarized from the network structure, network interior and optimization methods. After that the application of 3D convolutional neural network to the field of video understanding is explained. Finally, the contents summary of the paper and future development. This paper provides a systematic review of the latest research progress of 3D convolutional neural networks and their applications in the field of video understanding, which is of positive significance to the research and development of 3D convolutional neural network.
- Video understanding,
- Deep learning,
- 3D Convolutional Neural Network (3D CNN),
- Network structure

FullText(HTML)

References(48)

References

[1]	JI Shuiwang, XU Wei, YANG Ming, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221–231. doi: 10.1109/TPAMI.2012.59
[2]	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. The IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 4489–4497.
[3]	王磐, 强彦, 杨晓棠, 等. 基于双注意力3D-UNet的肺结节分割网络模型[J]. 计算机工程, 2021, 47(2): 307–313. doi: 10.19678/j.issn.1000-3428.0057019 WANG Pan, QIANG Yan, YANG Xiaotang, et al. Network model for lung nodule segmentation based on double attention 3D-UNet[J]. Computer Engineering, 2021, 47(2): 307–313. doi: 10.19678/j.issn.1000-3428.0057019
[4]	颜铭靖, 苏喜友. 基于三维空洞卷积残差神经网络的高光谱影像分类方法[J]. 光学学报, 2020, 40(16): 1628002. doi: 10.3788/AOS202040.1628002 YAN Mingjing and SU Xiyou. Hyperspectral image classification based on three-dimensional dilated convolutional residual neural network[J]. Acta Optica Sinica, 2020, 40(16): 1628002. doi: 10.3788/AOS202040.1628002
[5]	ALZUBAIDI L, ZHANG Jinglan, HUMAIDI A J, et al. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions[J]. Journal of Big Data, 2021, 8(1): 53. doi: 10.1186/s40537-021-00444-8
[6]	KATTENBORN T, LEITLOFF J, SCHIEFER F, et al. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2021, 173: 24–49. doi: 10.1016/j.isprsjprs.2020.12.010
[7]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
[8]	WU Peida, CUI Ziguan, GAN Zongliang, et al. Three-dimensional resNeXt network using feature fusion and label smoothing for hyperspectral image classification[J]. Sensors, 2020, 20(6): 1652. doi: 10.3390/s20061652
[9]	HUANG Gao, LIU Zhuang, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 2261–2269.
[10]	冯雨, 易本顺, 吴晨玥, 等. 基于三维卷积神经网络的肺结节识别研究[J]. 光学学报, 2019, 39(6): 0615006. doi: 10.3788/AOS201939.0615006 FENG Yu, YI Benshun, WU Chenyue, et al. Pulmonary nodule recognition based on three-dimensional convolution neural network[J]. Acta Optica Sinica, 2019, 39(6): 0615006. doi: 10.3788/AOS201939.0615006
[11]	段艳廷, 郑晓东, 胡莲莲, 等. 基于3D半密度卷积神经网络的断裂检测[J]. 地球物理学进展, 2019, 34(6): 2256–2261. doi: 10.6038/pg2019CC0367 DUAN Yanting, ZHENG Xiaodong, HU Lianlian, et al. Fault detection based on 3D semi-dense convolutional neural network[J]. Progress in Geophysics, 2019, 34(6): 2256–2261. doi: 10.6038/pg2019CC0367
[12]	丰艳, 张甜甜, 王传旭. 基于伪3D残差网络与交互关系建模的群组行为识别方法[J]. 电子学报, 2020, 48(7): 1269–1275. doi: 10.3969/j.issn.0372-2112.2020.07.004 FENG Yan, ZHANG Tiantian, and WANG Chuanxu. Group activity recognition method based on pseudo 3D residual network and interaction modeling[J]. Acta Electronica Sinica, 2020, 48(7): 1269–1275. doi: 10.3969/j.issn.0372-2112.2020.07.004
[13]	ZOLFAGHARI M, SINGH K, and BROX T. ECO: Efficient convolutional network for online video understanding[C]. The 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 713–730.
[14]	LU Changlei, LIU Bin, ZHOU Wenbo, et al. Deepfake video detection using 3D-attentional inception convolutional neural network[C]. 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, USA, 2021: 3572–3576.
[15]	胡正平, 刁鹏成, 张瑞雪, 等. 3D多支路聚合轻量网络视频行为识别算法研究[J]. 电子学报, 2020, 48(7): 1261–1268. doi: 10.3969/j.issn.0372-2112.2020.07.003 HU Zhengping, DIAO Pengcheng, ZHANG Ruixue, et al. Research on 3D multi-branch aggregated lightweight network video action recognition algorithm[J]. Acta Electronica Sinica, 2020, 48(7): 1261–1268. doi: 10.3969/j.issn.0372-2112.2020.07.003
[16]	MOLCHANOV P, YANG Xiaodong, GUPTA S, et al. Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2016: 4207–4215.
[17]	刘良鑫, 林勉芬, 钟良泉, 等. 基于3D双流卷积神经网络的异常行为检测[J]. 计算机系统应用, 2021, 30(5): 120–127. doi: 10.15888/j.cnki.csa.007912 LIU Liangxin, LIN Mianfen, ZHONG Liangquan, et al. Two-stream inflated 3D CNN for abnormal behavior detection[J]. Computer Systems &Applications, 2021, 30(5): 120–127. doi: 10.15888/j.cnki.csa.007912
[18]	HAN Yanling, WEI Cong, ZHOU Ruyan, et al. Combining 3D-CNN and squeeze-and-excitation networks for remote sensing sea ice image classification[J]. Mathematical Problems in Engineering, 2020, 2020: 8065396. doi: 10.1155/2020/8065396
[19]	王飞, 胡荣林, 金鹰. 基于3D-CBAM注意力机制的人体动作识别[J]. 南京师范大学学报:工程技术版, 2021, 21(1): 49–56. doi: 10.3969/j.issn.1672-1292.2021.01.008 WANG Fei, HU Ronglin, and JIN Ying. Human action recognition based on 3D-CBAM attention mechanism[J]. Journal of Nanjing Normal University:Engineering and Technology Edition, 2021, 21(1): 49–56. doi: 10.3969/j.issn.1672-1292.2021.01.008
[20]	XU Xuanang, ZHOU Fugen, and LIU Bo. Automatic bladder segmentation from CT images using deep CNN and 3D fully connected CRF-RNN[J]. International Journal of Computer Assisted Radiology and Surgery, 2018, 13(7): 967–975. doi: 10.1007/s11548-018-1733-7
[21]	XIE Saining, SUN Chen, HUANG J, et al. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification[C]. The 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 318–335.
[22]	WANG Limin, LI Wei, LI Wen, et al. Appearance-and-relation networks for video classification[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2017: 1430–1439.
[23]	LI Jiakun, WANG Tian, ZHOU Yi, et al. Using Gabor filter in 3D convolutional neural networks for human action recognition[C]. 2017 36th Chinese Control Conference (CCC), Dalian, China, 2017: 11139–11144.
[24]	QIU Zhaofan, YAO Ting, and MEI Tao. Learning spatio-temporal representation with pseudo-3D residual networks[C]. The IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5533–5541.
[25]	CARREIRA J and ZISSERMAN A. Quo Vadis, action recognition? A new model and the kinetics dataset[C]. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, 2017: 4724–4733.
[26]	YING Xinyi, WANG Longguang, WANG Yingqian, et al. Deformable 3D convolution for video super-resolution[J]. IEEE Signal Processing Letters, 2020, 27: 1500–1504. doi: 10.1109/LSP.2020.3013518
[27]	阮宏洋, 陈志澜, 程英升, 等. C-3D可变形卷积神经网络模型的肺结节检测[J]. 激光与光电子学进展, 2020, 57(4): 041013. doi: 10.3788/LOP57.041013 RUAN Hongyang, CHEN Zhilan, CHENG Yingsheng, et al. Detection of pulmonary nodules based on C-3D deformable convolutional neural network model[J]. Laser &Optoelectronics Progress, 2020, 57(4): 041013. doi: 10.3788/LOP57.041013
[28]	赵欣, 石德来, 王洪凯. 基于3D全卷积深度神经网络的脑白质病变分割方法[J]. 计算机与现代化, 2020(10): 44–50. doi: 10.3969/j.issn.1006-2475.2020.10.009 ZHAO Xin, SHI Delai, and WANG Hongkai. Segmentation of white matter lesions based on 3D full convolutional deep neural network[J]. Computer and Modernization, 2020(10): 44–50. doi: 10.3969/j.issn.1006-2475.2020.10.009
[29]	陆小玲, 吴海锋, 曾玉, 等. 3D迁移网络的阿尔茨海默症分类研究[J]. 计算机工程与应用, 2021, 57(16): 253–262. doi: 10.3778/j.issn.1002-8331.2005-0141 LU Xiaoling, WU Haifeng, ZENG Yu, et al. 3D transfer learning network for classification of Alzheimer's disease[J]. Computer Engineering and Applications, 2021, 57(16): 253–262. doi: 10.3778/j.issn.1002-8331.2005-0141
[30]	肖志云, 蒋家旭, 倪晨. 自适应深层残差3D-CNN高光谱图像快速分类算法[J]. 计算机辅助设计与图形学学报, 2019, 31(11): 2017–2029. doi: 10.3724/SP.J.1089.2019.17552 XIAO Zhiyun, JIANG Jiaxu, and NI Chen. Spectral-spatial classification of hyperspectral image based on self-adaptive deep residual 3D convolutional neural network[J]. Journal of Computer-Aided Design &Computer Graphics, 2019, 31(11): 2017–2029. doi: 10.3724/SP.J.1089.2019.17552
[31]	STROUD J C, ROSS D A, SUN Chen, et al. D3D: Distilled 3D networks for video action recognition[C]. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, USA, 2020: 614–623.
[32]	SINGH D, KUMAR V, KAUR M, et al. Screening of COVID-19 suspected subjects using multi-crossover genetic algorithm based dense convolutional neural network[J]. IEEE Access, 2021, 9: 142566–142580. doi: 10.1109/ACCESS.2021.3120717
[33]	ZHANG Yuxin, WANG Huan, LUO Yang, et al. Three-dimensional convolutional neural network pruning with regularization-based method[C]. 2019 IEEE International Conference on Image Processing (ICIP), Taipei, China, 2019: 4270–4274.
[34]	SHI Jixi, CHEN Zhihao, and COUTURIER R. Classification of pathological cases of myocardial infarction using convolutional neural network and random forest[C]. 11th International Workshop on Statistical Atlases and Computational Models of the Heart, Lima, Peru, 2021: 406–413.
[35]	SOOMRO K, ZAMIR A R, and SHAH M. UCF101: A dataset of 101 human actions classes from videos in the wild[EB/OL]. https://arxiv.org/abs/1212.0402, 2012.
[36]	KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A large video database for human motion recognition[C]. 2011 International Conference on Computer Vision, Barcelona, Spain, 2011: 2556–2563.
[37]	KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[EB/OL]. https://arxiv.org/abs/1705.06950, 2017.
[38]	KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1725–1732.
[39]	TRAN D, RAY J, SHOU Zheng, et al. ConvNet architecture search for spatiotemporal feature learning[EB/OL]. https://arxiv.org/abs/1708.05038, 2017.
[40]	ZONG Ming, WANG Ruili, CHEN Zhe, et al. Multi-cue based 3D residual network for action recognition[J]. Neural Computing and Applications, 2021, 33(10): 5167–5181. doi: 10.1007/s00521-020-05313-8
[41]	TRAN D, WANG Heng, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6450–6459.
[42]	ZHAI Jiecheng, YAO Xunxiang, DONG Guangyuan, et al. 3D dual-stream convolutional neural networks with simple recurrent unit network: A new framework for action recognition[C]. 2022 4th International Conference on Communications, Information System and Computer Engineering (CISCE), Shenzhen, China, 2022: 509–515.
[43]	JIANG Guanghao, JIANG Xiaoyan, FANG Zhijun, et al. An efficient attention module for 3D convolutional neural networks in action recognition[J]. Applied Intelligence, 2021, 51(10): 7043–7057. doi: 10.1007/s10489-021-02195-8
[44]	KIM D H, ANVAROV F, LEE J M, et al. Metric-based attention feature learning for video action recognition[J]. IEEE Access, 2021, 9: 39218–39228. doi: 10.1109/ACCESS.2021.3064934
[45]	WANG Xiaolong and GUPTA A. Unsupervised learning of visual representations using videos[C]. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015: 2794–2802.
[46]	YANG Xiangli, SONG Zixing, KING I, et al. A survey on deep semi-supervised learning[EB/OL]. https://arxiv.org/abs/2103.00550, 2021.
[47]	WANG Yaqing, YAO Quanming, KWOK J T, et al. Generalizing from a few examples: A survey on few-shot learning[J]. ACM Computing Surveys, 2021, 53(3): 63. doi: 10.1145/3386252
[48]	HAN Zongyan, FU Zhenyong, CHEN Shuo, et al. Contrastive embedding for generalized zero-shot learning[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 2371–2381.