高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

三维卷积神经网络及其在视频理解领域中的应用研究

白静 杨瞻源 彭斌 李文静

白静, 杨瞻源, 彭斌, 李文静. 三维卷积神经网络及其在视频理解领域中的应用研究[J]. 电子与信息学报, 2023, 45(6): 2273-2283. doi: 10.11999/JEIT220596
引用本文: 白静, 杨瞻源, 彭斌, 李文静. 三维卷积神经网络及其在视频理解领域中的应用研究[J]. 电子与信息学报, 2023, 45(6): 2273-2283. doi: 10.11999/JEIT220596
BAI Jing, YANG Zhanyuan, PENG Bin, LI Wenjing. Research on 3D Convolutional Neural Network and Its Application to Video Understanding[J]. Journal of Electronics & Information Technology, 2023, 45(6): 2273-2283. doi: 10.11999/JEIT220596
Citation: BAI Jing, YANG Zhanyuan, PENG Bin, LI Wenjing. Research on 3D Convolutional Neural Network and Its Application to Video Understanding[J]. Journal of Electronics & Information Technology, 2023, 45(6): 2273-2283. doi: 10.11999/JEIT220596

三维卷积神经网络及其在视频理解领域中的应用研究

doi: 10.11999/JEIT220596
基金项目: 国家自然科学基金(62162001, 61762003),宁夏自然科学基金(2022AAC02041),宁夏优秀人才支持计划,北方民族大学创新项目(YCX22194)
详细信息
    作者简介:

    白静:女,教授,硕士生导师,研究方向为机器学习、深度表征学习、计算机视觉应用

    杨瞻源:男,硕士生,研究方向为图像处理与计算机视觉、深度表征学习

    彭斌:男,硕士生,研究方向为图像处理与计算机视觉、深度表征学习

    李文静:女,硕士生,研究方向为图像处理与计算机视觉、深度表征学习

    通讯作者:

    杨瞻源 1273907064@qq.com

  • 中图分类号: TP399

Research on 3D Convolutional Neural Network and Its Application to Video Understanding

Funds: The National Natural Science Foundation of China (62162001, 61762003), The Natural Science Foundation of Ningxia Province of China (2022AAC02041), The CAS “Light of West China” Program, The Ningxia Excellent Talent Program, North Minzu University Innovation Project(YCX22194)
  • 摘要: 3维卷积神经网络(3D CNN)是近几年来深度学习研究中的热点,在计算机视觉领域取得了诸多成就。虽然研究多年且成果丰富,但目前仍缺少关于此内容全面、细致的综述。基于此,该文从以下几个方面对其进行综述:首先阐述3维卷积神经网络的基本原理和模型结构,接着从网络结构、网络内部和优化方法总结3维卷积神经网络的相关改进工作,然后对3维卷积神经网络在视频理解领域中的应用进行总结,最后总结全文内容并对未来发展方向进行展望。该文针对3维卷积神经网络的最新研究进展以及在视频理解领域中的应用进行了系统的综述,对3维卷积神经网络的研究发展具有一定的积极意义。
  • 图  1  3D CNN网络模型改进思路

    图  2  网络深度方向的改进

    图  3  网络宽度方向的改进

    图  4  卷积层的改进

    图  5  3D CNN在行为识别任务中的应用

    表  1  常用的行为识别数据集

    数据集类别数视频数训练集测试集动作类型
    UCF-101[35]1011332093243996人物交互、肢体动作、人人交互、乐器演奏、体育运动
    HMDB-51[36]51676647362030常见/复杂的面部动作、常见/复杂的肢体动作、多人交互动作
    Kinetics400[37]40025438023461919761人物交互和人人交互
    Sports-1M[38]4871133158793211339947运动视频
    下载: 导出CSV

    表  2  行为识别任务中不同3D CNN在不同数据集上的性能对比(表内数据源于相关论文)

    改进角度年份网络不同数据集上的准确率(%)参数量
    (M)
    计算速率
    (VPS/GFLOPs)
    UCF-101HMDB-51Kinetics400Sports-1M
    基础结构2015C3D82.340.485.233.4/
    残差连接2017Res3D85.854.987.833.20.9/
    2021R-M3D93.265.4/
    卷积核拆分2017P3D88.687.4<2.0/
    2018R(2+1)D97.378.775.491.933.3/
    2018S3D-G96.875.976.211.6/71.4
    3维膨胀卷积2017I3D93.466.472.625.0/107.9
    2D+3D2018ARTNet93.567.672.433.42.9/20.0
    2018ECO94.872.428.2/
    多支路20223D Dual-Stream-SRU95.376.5/
    知识蒸馏2020D3D97.680.575.9/
    注意力模块2021EAM+ResNet5089.865.446.3/10.1
    DA+ResNext10195.874.3/
    下载: 导出CSV
  • [1] JI Shuiwang, XU Wei, YANG Ming, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221–231. doi: 10.1109/TPAMI.2012.59
    [2] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. The IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 4489–4497.
    [3] 王磐, 强彦, 杨晓棠, 等. 基于双注意力3D-UNet的肺结节分割网络模型[J]. 计算机工程, 2021, 47(2): 307–313. doi: 10.19678/j.issn.1000-3428.0057019

    WANG Pan, QIANG Yan, YANG Xiaotang, et al. Network model for lung nodule segmentation based on double attention 3D-UNet[J]. Computer Engineering, 2021, 47(2): 307–313. doi: 10.19678/j.issn.1000-3428.0057019
    [4] 颜铭靖, 苏喜友. 基于三维空洞卷积残差神经网络的高光谱影像分类方法[J]. 光学学报, 2020, 40(16): 1628002. doi: 10.3788/AOS202040.1628002

    YAN Mingjing and SU Xiyou. Hyperspectral image classification based on three-dimensional dilated convolutional residual neural network[J]. Acta Optica Sinica, 2020, 40(16): 1628002. doi: 10.3788/AOS202040.1628002
    [5] ALZUBAIDI L, ZHANG Jinglan, HUMAIDI A J, et al. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions[J]. Journal of Big Data, 2021, 8(1): 53. doi: 10.1186/s40537-021-00444-8
    [6] KATTENBORN T, LEITLOFF J, SCHIEFER F, et al. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2021, 173: 24–49. doi: 10.1016/j.isprsjprs.2020.12.010
    [7] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
    [8] WU Peida, CUI Ziguan, GAN Zongliang, et al. Three-dimensional resNeXt network using feature fusion and label smoothing for hyperspectral image classification[J]. Sensors, 2020, 20(6): 1652. doi: 10.3390/s20061652
    [9] HUANG Gao, LIU Zhuang, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 2261–2269.
    [10] 冯雨, 易本顺, 吴晨玥, 等. 基于三维卷积神经网络的肺结节识别研究[J]. 光学学报, 2019, 39(6): 0615006. doi: 10.3788/AOS201939.0615006

    FENG Yu, YI Benshun, WU Chenyue, et al. Pulmonary nodule recognition based on three-dimensional convolution neural network[J]. Acta Optica Sinica, 2019, 39(6): 0615006. doi: 10.3788/AOS201939.0615006
    [11] 段艳廷, 郑晓东, 胡莲莲, 等. 基于3D半密度卷积神经网络的断裂检测[J]. 地球物理学进展, 2019, 34(6): 2256–2261. doi: 10.6038/pg2019CC0367

    DUAN Yanting, ZHENG Xiaodong, HU Lianlian, et al. Fault detection based on 3D semi-dense convolutional neural network[J]. Progress in Geophysics, 2019, 34(6): 2256–2261. doi: 10.6038/pg2019CC0367
    [12] 丰艳, 张甜甜, 王传旭. 基于伪3D残差网络与交互关系建模的群组行为识别方法[J]. 电子学报, 2020, 48(7): 1269–1275. doi: 10.3969/j.issn.0372-2112.2020.07.004

    FENG Yan, ZHANG Tiantian, and WANG Chuanxu. Group activity recognition method based on pseudo 3D residual network and interaction modeling[J]. Acta Electronica Sinica, 2020, 48(7): 1269–1275. doi: 10.3969/j.issn.0372-2112.2020.07.004
    [13] ZOLFAGHARI M, SINGH K, and BROX T. ECO: Efficient convolutional network for online video understanding[C]. The 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 713–730.
    [14] LU Changlei, LIU Bin, ZHOU Wenbo, et al. Deepfake video detection using 3D-attentional inception convolutional neural network[C]. 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, USA, 2021: 3572–3576.
    [15] 胡正平, 刁鹏成, 张瑞雪, 等. 3D多支路聚合轻量网络视频行为识别算法研究[J]. 电子学报, 2020, 48(7): 1261–1268. doi: 10.3969/j.issn.0372-2112.2020.07.003

    HU Zhengping, DIAO Pengcheng, ZHANG Ruixue, et al. Research on 3D multi-branch aggregated lightweight network video action recognition algorithm[J]. Acta Electronica Sinica, 2020, 48(7): 1261–1268. doi: 10.3969/j.issn.0372-2112.2020.07.003
    [16] MOLCHANOV P, YANG Xiaodong, GUPTA S, et al. Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2016: 4207–4215.
    [17] 刘良鑫, 林勉芬, 钟良泉, 等. 基于3D双流卷积神经网络的异常行为检测[J]. 计算机系统应用, 2021, 30(5): 120–127. doi: 10.15888/j.cnki.csa.007912

    LIU Liangxin, LIN Mianfen, ZHONG Liangquan, et al. Two-stream inflated 3D CNN for abnormal behavior detection[J]. Computer Systems &Applications, 2021, 30(5): 120–127. doi: 10.15888/j.cnki.csa.007912
    [18] HAN Yanling, WEI Cong, ZHOU Ruyan, et al. Combining 3D-CNN and squeeze-and-excitation networks for remote sensing sea ice image classification[J]. Mathematical Problems in Engineering, 2020, 2020: 8065396. doi: 10.1155/2020/8065396
    [19] 王飞, 胡荣林, 金鹰. 基于3D-CBAM注意力机制的人体动作识别[J]. 南京师范大学学报:工程技术版, 2021, 21(1): 49–56. doi: 10.3969/j.issn.1672-1292.2021.01.008

    WANG Fei, HU Ronglin, and JIN Ying. Human action recognition based on 3D-CBAM attention mechanism[J]. Journal of Nanjing Normal University:Engineering and Technology Edition, 2021, 21(1): 49–56. doi: 10.3969/j.issn.1672-1292.2021.01.008
    [20] XU Xuanang, ZHOU Fugen, and LIU Bo. Automatic bladder segmentation from CT images using deep CNN and 3D fully connected CRF-RNN[J]. International Journal of Computer Assisted Radiology and Surgery, 2018, 13(7): 967–975. doi: 10.1007/s11548-018-1733-7
    [21] XIE Saining, SUN Chen, HUANG J, et al. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification[C]. The 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 318–335.
    [22] WANG Limin, LI Wei, LI Wen, et al. Appearance-and-relation networks for video classification[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2017: 1430–1439.
    [23] LI Jiakun, WANG Tian, ZHOU Yi, et al. Using Gabor filter in 3D convolutional neural networks for human action recognition[C]. 2017 36th Chinese Control Conference (CCC), Dalian, China, 2017: 11139–11144.
    [24] QIU Zhaofan, YAO Ting, and MEI Tao. Learning spatio-temporal representation with pseudo-3D residual networks[C]. The IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5533–5541.
    [25] CARREIRA J and ZISSERMAN A. Quo Vadis, action recognition? A new model and the kinetics dataset[C]. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, 2017: 4724–4733.
    [26] YING Xinyi, WANG Longguang, WANG Yingqian, et al. Deformable 3D convolution for video super-resolution[J]. IEEE Signal Processing Letters, 2020, 27: 1500–1504. doi: 10.1109/LSP.2020.3013518
    [27] 阮宏洋, 陈志澜, 程英升, 等. C-3D可变形卷积神经网络模型的肺结节检测[J]. 激光与光电子学进展, 2020, 57(4): 041013. doi: 10.3788/LOP57.041013

    RUAN Hongyang, CHEN Zhilan, CHENG Yingsheng, et al. Detection of pulmonary nodules based on C-3D deformable convolutional neural network model[J]. Laser &Optoelectronics Progress, 2020, 57(4): 041013. doi: 10.3788/LOP57.041013
    [28] 赵欣, 石德来, 王洪凯. 基于3D全卷积深度神经网络的脑白质病变分割方法[J]. 计算机与现代化, 2020(10): 44–50. doi: 10.3969/j.issn.1006-2475.2020.10.009

    ZHAO Xin, SHI Delai, and WANG Hongkai. Segmentation of white matter lesions based on 3D full convolutional deep neural network[J]. Computer and Modernization, 2020(10): 44–50. doi: 10.3969/j.issn.1006-2475.2020.10.009
    [29] 陆小玲, 吴海锋, 曾玉, 等. 3D迁移网络的阿尔茨海默症分类研究[J]. 计算机工程与应用, 2021, 57(16): 253–262. doi: 10.3778/j.issn.1002-8331.2005-0141

    LU Xiaoling, WU Haifeng, ZENG Yu, et al. 3D transfer learning network for classification of Alzheimer's disease[J]. Computer Engineering and Applications, 2021, 57(16): 253–262. doi: 10.3778/j.issn.1002-8331.2005-0141
    [30] 肖志云, 蒋家旭, 倪晨. 自适应深层残差3D-CNN高光谱图像快速分类算法[J]. 计算机辅助设计与图形学学报, 2019, 31(11): 2017–2029. doi: 10.3724/SP.J.1089.2019.17552

    XIAO Zhiyun, JIANG Jiaxu, and NI Chen. Spectral-spatial classification of hyperspectral image based on self-adaptive deep residual 3D convolutional neural network[J]. Journal of Computer-Aided Design &Computer Graphics, 2019, 31(11): 2017–2029. doi: 10.3724/SP.J.1089.2019.17552
    [31] STROUD J C, ROSS D A, SUN Chen, et al. D3D: Distilled 3D networks for video action recognition[C]. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, USA, 2020: 614–623.
    [32] SINGH D, KUMAR V, KAUR M, et al. Screening of COVID-19 suspected subjects using multi-crossover genetic algorithm based dense convolutional neural network[J]. IEEE Access, 2021, 9: 142566–142580. doi: 10.1109/ACCESS.2021.3120717
    [33] ZHANG Yuxin, WANG Huan, LUO Yang, et al. Three-dimensional convolutional neural network pruning with regularization-based method[C]. 2019 IEEE International Conference on Image Processing (ICIP), Taipei, China, 2019: 4270–4274.
    [34] SHI Jixi, CHEN Zhihao, and COUTURIER R. Classification of pathological cases of myocardial infarction using convolutional neural network and random forest[C]. 11th International Workshop on Statistical Atlases and Computational Models of the Heart, Lima, Peru, 2021: 406–413.
    [35] SOOMRO K, ZAMIR A R, and SHAH M. UCF101: A dataset of 101 human actions classes from videos in the wild[EB/OL]. https://arxiv.org/abs/1212.0402, 2012.
    [36] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A large video database for human motion recognition[C]. 2011 International Conference on Computer Vision, Barcelona, Spain, 2011: 2556–2563.
    [37] KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[EB/OL]. https://arxiv.org/abs/1705.06950, 2017.
    [38] KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1725–1732.
    [39] TRAN D, RAY J, SHOU Zheng, et al. ConvNet architecture search for spatiotemporal feature learning[EB/OL]. https://arxiv.org/abs/1708.05038, 2017.
    [40] ZONG Ming, WANG Ruili, CHEN Zhe, et al. Multi-cue based 3D residual network for action recognition[J]. Neural Computing and Applications, 2021, 33(10): 5167–5181. doi: 10.1007/s00521-020-05313-8
    [41] TRAN D, WANG Heng, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6450–6459.
    [42] ZHAI Jiecheng, YAO Xunxiang, DONG Guangyuan, et al. 3D dual-stream convolutional neural networks with simple recurrent unit network: A new framework for action recognition[C]. 2022 4th International Conference on Communications, Information System and Computer Engineering (CISCE), Shenzhen, China, 2022: 509–515.
    [43] JIANG Guanghao, JIANG Xiaoyan, FANG Zhijun, et al. An efficient attention module for 3D convolutional neural networks in action recognition[J]. Applied Intelligence, 2021, 51(10): 7043–7057. doi: 10.1007/s10489-021-02195-8
    [44] KIM D H, ANVAROV F, LEE J M, et al. Metric-based attention feature learning for video action recognition[J]. IEEE Access, 2021, 9: 39218–39228. doi: 10.1109/ACCESS.2021.3064934
    [45] WANG Xiaolong and GUPTA A. Unsupervised learning of visual representations using videos[C]. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015: 2794–2802.
    [46] YANG Xiangli, SONG Zixing, KING I, et al. A survey on deep semi-supervised learning[EB/OL]. https://arxiv.org/abs/2103.00550, 2021.
    [47] WANG Yaqing, YAO Quanming, KWOK J T, et al. Generalizing from a few examples: A survey on few-shot learning[J]. ACM Computing Surveys, 2021, 53(3): 63. doi: 10.1145/3386252
    [48] HAN Zongyan, FU Zhenyong, CHEN Shuo, et al. Contrastive embedding for generalized zero-shot learning[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 2371–2381.
  • 加载中
图(5) / 表(2)
计量
  • 文章访问数:  1451
  • HTML全文浏览量:  1272
  • PDF下载量:  197
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-05-11
  • 修回日期:  2022-11-18
  • 网络出版日期:  2022-11-21
  • 刊出日期:  2023-06-10

目录

    /

    返回文章
    返回