高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

改进通道注意力机制下的人体行为识别网络

陈莹 龚苏明

陈莹, 龚苏明. 改进通道注意力机制下的人体行为识别网络[J]. 电子与信息学报, 2021, 43(12): 3538-3545. doi: 10.11999/JEIT200431
引用本文: 陈莹, 龚苏明. 改进通道注意力机制下的人体行为识别网络[J]. 电子与信息学报, 2021, 43(12): 3538-3545. doi: 10.11999/JEIT200431
Ying CHEN, Suming GONG. Human Action Recognition Network Based on Improved Channel Attention Mechanism[J]. Journal of Electronics & Information Technology, 2021, 43(12): 3538-3545. doi: 10.11999/JEIT200431
Citation: Ying CHEN, Suming GONG. Human Action Recognition Network Based on Improved Channel Attention Mechanism[J]. Journal of Electronics & Information Technology, 2021, 43(12): 3538-3545. doi: 10.11999/JEIT200431

改进通道注意力机制下的人体行为识别网络

doi: 10.11999/JEIT200431
基金项目: 国家自然科学基金(61573168)
详细信息
    作者简介:

    陈莹:女,1976年生,教授,博士,研究方向为信息融合、模式识别. Euclid

    龚苏明:男,1995年生,硕士生,研究方向为计算机视觉与模式识别

    通讯作者:

    陈莹 chenying@jiangnan.edu.cn

  • 中图分类号: TN911.73; TP391.4

Human Action Recognition Network Based on Improved Channel Attention Mechanism

Funds: The National Natural Science Foundation of China (61573168)
  • 摘要: 针对现有通道注意力机制对各通道信息直接全局平均池化而忽略其局部空间信息的问题,该文结合人体行为识别研究提出了两种改进通道注意力模块,即矩阵操作的时空(ST)交互模块和深度可分离卷积(DS)模块。ST模块通过卷积和维度转换操作提取各通道时空加权信息数列,经卷积得到各通道的注意权重;DS模块首先利用深度可分离卷积获取各通道局部空间信息,然后压缩通道尺寸使其具有全局的感受野,接着通过卷积操作得到各通道注意权重,进而完成通道注意力机制下的特征重标定。将改进后的注意力模块插入基础网络并在常见的人体行为识别数据集UCF101和HDBM51上进行实验分析,实现了准确率的提升。
  • 图  1  SE模块

    图  2  改进的通道注意力模块

    图  3  网络模块示意图

    图  4  ST模块详细示意图

    图  5  DS_Block详细示意图

    图  6  不同注意力模块可视化结果

    表  1  验证注意力模块

    方法主干网络UCF101准确率(%)HMDB51准确率(%)
    TSN[5]ResNet-10185.754.6
    TSN+SEResNet-10186.155.6
    TSN+DSResNet-10187.256.4
    TSN+STResNet-10187.055.8
    MiCT[16]ResNet-3469.040.5
    MiCT+SEResNet-3470.141.2
    MiCT+DSResNet-3470.841.8
    MiCT+STResNet-3470.441.3
    下载: 导出CSV

    表  2  网络参数对比结果

    方法主干网络参数大小(M)
    MiCTResNet-3426.16
    MiCT+SEResNet-3426.30
    MiCT+DSResNet-3426.31
    MiCT+STResNet-3430.09
    ResNet-50ResNet-5023.71
    ResNet-50+SEResNet-5026.20
    ResNet-50+DSResNet-5026.22
    ResNet-50+STResNet-5086.61
    下载: 导出CSV

    表  3  注意力模块的精度及运行时间比较

    方法准确率(%)平均运行时间(s)
    ResNet-5085.70.93
    ResNet-50+SE86.11.20
    ResNet-50+DS87.22.20
    ResNet-50+ST87.03.39
    下载: 导出CSV

    表  4  不同算法在UCF101与HMDB51数据集上识别准确率对比(单流输入)

    方法输入主干网络预训练UCF101(%)HMDB51(%)fps
    C3D[21]RGB3D Conv.Sports-1M44.043.94.2
    TS+LSTM[19]RGBResNet+LSTMImageNet82.0
    TSN[5]RGBResNet101ImageNet85.754.68.5
    LTC[22]RGBResNet-50ImageNet83.052.8
    TLE[20]RGB3D Conv.ImageNet86.363.2
    TLE[20]RGBBN-InceptionImageNet86.963.5
    I3D[18]RGBBN-InceptionImageNet+Kinetics84.549.88.3
    P3D[17]RGBResNet-101ImageNet+Kinetics86.813.4
    MiCT[16]RGBResNet-101ImageNet+Kinetics86.162.84.8
    C-LSTM[8]RGBResNet+LSTMImageNet84.9641.38.0
    TSN+DSRGBResNet-101ImageNet87.364.43.6
    MiCT+DSRGBResNet-101ImageNet87.064.22.1
    下载: 导出CSV

    表  5  不同算法在UCF101与HMDB51数据集上识别准确率对比(双流输入)

    方法输入主干网络预训练UCF101(%)HMDB51(%)
    DTPP[23]RGB+FLOWResNet-101ImageNet89.761.1
    TS+LSTM[19]RGB+FLOWResNet+LSTMImageNet88.1
    LTC[22]RGB+FLOWResNet-50ImageNet91.764.8
    TLE[20]RGB+FLOWBN-InceptionImageNet+Kinetics95.670.8
    T3D[24]RGB+FLOWResNet-50ImageNet+Kinetics91.761.1
    I3D[18]RGB+FLOWBN-InceptionImageNet93.269.3
    TSM[25]RGB+FLOWResNet-50ImageNet+Kinetics94.570.7
    MiCT-ARGB+FLOWResNet-101ImageNet94.270.0
    MiCT-BRGB+FLOWResNet-101ImageNet94.670.9
    下载: 导出CSV
  • [1] IKIZLER-CINBIS N and SCLAROFF S. Object, scene and actions: Combining multiple features for human action recognition[C]. The 11th European Conference on Computer Vision, Heraklion, Greece, 2010: 494–507.
    [2] 张良, 鲁梦梦, 姜华. 局部分布信息增强的视觉单词描述与动作识别[J]. 电子与信息学报, 2016, 38(3): 549–556. doi: 10.11999/JEIT150410

    ZHANG Liang, LU Mengmeng, and JIANG Hua. An improved scheme of visual words description and action recognition using local enhanced distribution information[J]. Journal of Electronics &Information Technology, 2016, 38(3): 549–556. doi: 10.11999/JEIT150410
    [3] KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]. 2014 IEEE Conference on Computer Vision and Pattern Recognition, Colombus, USA, 2014: 1725–1732.
    [4] SIMONYAN K and ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]. The 27th International Conference on Neural Information Processing Systems - Volume 1, Montreal, Canada, 2014: 568–576.
    [5] WANG Limin, XIONG Yuanjun, WANG Zhe, et al. Temporal segment networks: Towards good practices for deep action recognition[C]. The 14th European Conference, Amsterdam, The Kingdom of the Netherlands, 2016: 20–36.
    [6] JI Shuiwang, XU Wei, YANG Ming, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221–231. doi: 10.1109/TPAMI.2012.59
    [7] ZHU Yi, LAN Zhenzhong, NEWSAM S, et al. Hidden two-stream convolutional networks for action recognition[C]. The 14th Asian Conference on Computer Vision, Perth, Australia, 2018: 363–378.
    [8] SHARMA S, KIROS R, and SALAKHUTDINOV R. Action recognition using visual attention[C]. The International Conference on Learning Representations 2016, San Juan, The Commonwealth of Puerto Rico, 2016: 1–11.
    [9] 胡正平, 刁鹏成, 张瑞雪, 等. 3D多支路聚合轻量网络视频行为识别算法研究[J]. 电子学报, 2020, 48(7): 1261–1268. doi: 10.3969/j.issn.0372-2112.2020.07.003

    HU Zhengping, DIAO Pengcheng, ZHANG Ruixue, et al. Research on 3D multi-branch aggregated lightweight network video action recognition algorithm[J]. Acta Electronica Sinica, 2020, 48(7): 1261–1268. doi: 10.3969/j.issn.0372-2112.2020.07.003
    [10] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2016: 770–778.
    [11] HU Jie, SHEN Li, and SUN Gang. Squeeze-and-excitation networks[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141.
    [12] IOFFE S and SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[EB/OL]. https://arxiv.org/abs/1502.03167v2, 2015.
    [13] WU Yuxin and HE Kaiming. Group normalization[C]. The European Conference on Computer Vision, Amsterdam, The Kingdom of the Netherlands, 2018: 3–19.
    [14] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017: 2999–3007.
    [15] 周波, 李俊峰. 结合目标检测的人体行为识别[J]. 自动化学报, 2020, 46(9): 1961–1970.

    ZHOU Bo and LI Junfeng. Human action recognition combined with object detection[J]. Acta Automatica Sinica, 2020, 46(9): 1961–1970.
    [16] ZHOU Yizhou, SUN Xiaoyan, ZHA Zhengjun, et al. MiCT: Mixed 3D/2D convolutional tube for human action recognition[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 449–458.
    [17] QIU Zhaofan, YAO Ting, and MEI Tao. Learning spatio-temporal representation with pseudo-3D residual networks[C]. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017: 5534–5542.
    [18] CARREIRA J and ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, 2017: 4724–4733.
    [19] MA C Y, CHEN M H, KIRA Z, et al. TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition[J]. Signal Processing: Image Communication, 2019, 71: 76–87. doi: 10.1016/j.image.2018.09.003
    [20] DIBA A, SHARMA V, and VAN GOOL L. Deep temporal linear encoding networks[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, 2017: 1541–1550.
    [21] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015: 4489–4497.
    [22] VAROL G, LAPTEV I, and SCHMID C. Long-term temporal convolutions for action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1510–1517. doi: 10.1109/TPAMI.2017.2712608
    [23] ZHU Jiagang, ZHU Zheng, and ZOU Wei. End-to-end video-level representation learning for action recognition[C]. 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 2018: 645–650.
    [24] DIBA A, FAYYAZ M, SHARMA V, et al. Temporal 3D convnets: New architecture and transfer learning for video classification[EB/OL]. https://arxiv.org/abs/1711.08200v1, 2017.
    [25] LIN Ji, GAN Chuang, and HAN Song. TSM: Temporal shift module for efficient video understanding[C]. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea(south), 2019: 7082–7092.
  • 加载中
图(6) / 表(5)
计量
  • 文章访问数:  1017
  • HTML全文浏览量:  399
  • PDF下载量:  151
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-05-29
  • 修回日期:  2021-06-03
  • 网络出版日期:  2021-08-24
  • 刊出日期:  2021-12-21

目录

    /

    返回文章
    返回