高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于骨架动作识别的协作卷积Transformer网络

石跃祥 朱茂清

石跃祥, 朱茂清. 基于骨架动作识别的协作卷积Transformer网络[J]. 电子与信息学报, 2023, 45(4): 1485-1493. doi: 10.11999/JEIT220270
引用本文: 石跃祥, 朱茂清. 基于骨架动作识别的协作卷积Transformer网络[J]. 电子与信息学报, 2023, 45(4): 1485-1493. doi: 10.11999/JEIT220270
SHI Yuexiang, ZHU Maoqing. Collaborative Convolutional Transformer Network Based on Skeleton Action Recognition[J]. Journal of Electronics & Information Technology, 2023, 45(4): 1485-1493. doi: 10.11999/JEIT220270
Citation: SHI Yuexiang, ZHU Maoqing. Collaborative Convolutional Transformer Network Based on Skeleton Action Recognition[J]. Journal of Electronics & Information Technology, 2023, 45(4): 1485-1493. doi: 10.11999/JEIT220270

基于骨架动作识别的协作卷积Transformer网络

doi: 10.11999/JEIT220270
基金项目: 国家自然科学基金(62172349, 62172350),湖南省学位和研究生教育改革研究一般项目(2021JGYB085)
详细信息
    作者简介:

    石跃祥:男,教授,硕士生导师,研究方向为图像处理和行为识别

    朱茂清:男,硕士,研究方向为动作识别

    通讯作者:

    朱茂清 201921002020@smail.xtu.edu.cn

  • 中图分类号: TN911.73; TP391.4

Collaborative Convolutional Transformer Network Based on Skeleton Action Recognition

Funds: The National Natural Science Foundation of China (62172349, 62172350), Hunan Province Degree and Postgraduate Education Reform Research General Project (2021JGYB085)
  • 摘要: 近年来,基于骨架的人体动作识别任务因骨架数据的鲁棒性和泛化能力而受到了广泛关注。其中,将人体骨骼建模为时空图的图卷积网络取得了显著的性能。然而图卷积主要通过一系列3D卷积来学习长期交互联系,这种联系偏向于局部并且受到卷积核大小的限制,无法有效地捕获远程依赖关系。该文提出一种协作卷积Transformer网络(Co-ConvT),通过引入Transformer中的自注意力机制建立远程依赖关系,并将其与图卷积神经网络(GCNs)相结合进行动作识别,使模型既能通过图卷积神经网络提取局部信息,也能通过Transformer捕获丰富的远程依赖项。另外,Transformer的自注意力机制在像素级进行计算,因此产生了极大的计算代价,该模型通过将整个网络分为两个阶段,第1阶段使用纯卷积来提取浅层空间特征,第2阶段使用所提出的ConvT块捕获高层语义信息,降低了计算复杂度。此外,原始Transformer中的线性嵌入被替换为卷积嵌入,获得局部空间信息增强,并由此去除了原始模型中的位置编码,使模型更轻量。在两个大规模权威数据集NTU-RGB+D和Kinetics-Skeleton上进行实验验证,该模型分别达到了88.1%和36.6%的Top-1精度。实验结果表明,该模型的性能有了很大的提高。
  • 图  1  在空间流与时间流上的关节连接示意图

    图  2  Co-ConvT网络层示意图

    图  3  卷积Transformer基本块结构图

    图  4  ConvT层内部框架图

    图  5  与2s-AGCN模型精度比较

    表  1  在Kinetics-skeleton数据集上与其他模型的性能对比(%)

    模型骨骼流Top-1精度Top-5精度
    ST-GCN[3]30.752.8
    AS-GCN[18]34.856.5
    2s-AGCN[19]36.158.7
    SAN[28]35.155.7
    Co-ConvT36.660.0
    下载: 导出CSV

    表  2  在NTU-60数据集上与其他模型的性能对比(%)

    模型X-Sub基准精度X-View基准精度
    ST-GCN[3]81.588.3
    DPRL[29]83.589.8
    HCN[30]86.591.1
    SAN[28]87.292.7
    AS-GCN[18]86.894.2
    STA-GCN[17]87.795.0
    1s-Shift-GCN[4]87.895.1
    Co-ConvT88.194.3
    下载: 导出CSV

    表  3  在参数和精度方面与基线模型的对比

    模型参数量(${10^5}$)Top-1精度(%)Top-5精度(%)
    ST-GCN[3]31.130.752.8
    2s-AGCN[19]35.536.158.7
    Co-ConvT28.736.660.0
    下载: 导出CSV

    表  4  不同嵌入方法和移除位置编码在Kinetics-skeleton数据集上对性能的影响(%)

    嵌入方法位置编码Top-1精度Top-5精度
    线性嵌入×35.258.1
    卷积嵌入35.157.8
    卷积嵌入×35.458.3
    下载: 导出CSV

    表  5  不同ConvT层数在Kinetics-skeleton数据集的识别精度(%)

    层数Top-1Top-5
    235.257.7
    335.558.1
    435.658.3
    535.458.0
    635.157.7
    下载: 导出CSV
  • [1] 石跃祥, 周玥. 基于阶梯型特征空间分割与局部注意力机制的行人重识别[J]. 电子与信息学报, 2022, 44(1): 195–202. doi: 10.11999/JEIT201006

    SHI Yuexiang and ZHOU Yue. Person re-identification based on stepped feature space segmentation and local attention mechanism[J]. Journal of Electronics &Information Technology, 2022, 44(1): 195–202. doi: 10.11999/JEIT201006
    [2] NIEPERT M, AHMED M, and KUTZKOV K. Learning convolutional neural networks for graphs[C]. The 33rd International Conference on International Conference on Machine Learning, New York, USA, 2016: 2014–2023.
    [3] YAN Sijie, XIONG Yuanjun, and LIN Dahua. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]. The Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, USA, 2018: 912.
    [4] CHENG Ke, ZHANG Yifan, HE Xiangyu, et al. Skeleton-based action recognition with shift graph convolutional network[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 180–189.
    [5] LIU Ziyu, ZHANG Hongwen, CHEN Zhenghao, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 140–149.
    [6] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
    [7] MEINHARDT T, KIRILLOV A, LEAL-TAIXE L, et al. TrackFormer: Multi-object tracking with transformers[J]. arXiv: 2101.02702, 2021.
    [8] SUN Peize, CAO Jinkun, JIANG Yi, et al. TransTrack: Multiple object tracking with transformer[J]. arXiv: 2012.15460, 2020.
    [9] ZHENG CE, ZHU Sijie, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 11636–11645.
    [10] CHU Peng, WANG Jiang, YOU Quanzeng, et al. TransMOT: Spatial-temporal graph transformer for multiple object tracking[J]. arXiv: 2104.00194, 2021.
    [11] FERNANDO B, GAVVES E, ORAMAS J M, et al. Modeling video evolution for action recognition[C]. 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 5378–5387.
    [12] LI Shuai, LI Wanqing, COOK C, et al. Independently Recurrent Neural Network (IndRNN): Building a longer and deeper RNN[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 5457–5466.
    [13] LI Chao, ZHONG Qiaoyong, XIE Di, et al. Skeleton-based action recognition with convolutional neural networks[C]. 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 2017: 597–600.
    [14] ZHANG Pengfei, LAN Cuiling, ZENG Wenjun, et al. Semantics-guided neural networks for efficient skeleton-based human action recognition[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 1109–1118.
    [15] ZHANG Xikun, XU Chang, and TAO Dacheng. Context aware graph convolution for skeleton-based action recognition[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 14321–14330.
    [16] 曾胜强, 李琳. 基于姿态校正与姿态融合的2D/3D骨架动作识别方法[J]. 计算机应用研究, 2022, 39(3): 900–905. doi: 10.19734/j.issn.1001-3695.2021.07.0286

    ZENG Shengqiang and LI Lin. 2D/3D skeleton action recognition based on posture transformation and posture fusion[J]. Application Research of Computers, 2022, 39(3): 900–905. doi: 10.19734/j.issn.1001-3695.2021.07.0286
    [17] 李扬志, 袁家政, 刘宏哲. 基于时空注意力图卷积网络模型的人体骨架动作识别算法[J]. 计算机应用, 2021, 41(7): 1915–1921. doi: 10.11772/j.issn.1001-9081.2020091515

    LI Yangzhi, YUAN Jiazheng, and LIU Hongzhe. Human skeleton-based action recognition algorithm based on spatiotemporal attention graph convolutional network model[J]. Journal of Computer Applications, 2021, 41(7): 1915–1921. doi: 10.11772/j.issn.1001-9081.2020091515
    [18] LI Maosen, CHEN Siheng, CHEN Xu, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3590–3598.
    [19] SHI Lei, ZHANG Yifan, CHENG Jian, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 12018–12027.
    [20] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C/OL]. The 9th International Conference on Learning Representations, 2021.
    [21] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[L/OL]. The 38th International Conference on Machine Learning, 2021: 10347–10357.
    [22] RAMACHANDRAN P, PARMAR N, VASWANI A, et al. Stand-alone self-attention in vision models[C]. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, 2019: 7.
    [23] SHARIR G, NOY A, and ZELNIK-MANOR L. An image is worth 16x16 words, what is a video worth?[J]. arXiv: 2103.13915, 2021.
    [24] PLIZZARI C, CANNICI M, and MATTEUCCI M. Spatial temporal transformer network for skeleton-based action recognition[C]. International Conference on Pattern Recognition, Milano, Italy, 2021: 694–701.
    [25] BA J L, KIROS J R, and HINTON G E. Layer normalization[J]. arXiv: 1607.06450, 2016.
    [26] SHAHROUDY A, LIU Jun, NG T T, et al. NTU RGB+D: A large scale dataset for 3D human activity analysis[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 1010–1019.
    [27] KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[J]. arXiv: 1705.06950, 2017.
    [28] CHO S, MAQBOOL M H, LIU Fei, et al. Self-attention network for skeleton-based human action recognition[C]. 2020 IEEE Winter Conference on Applications of Computer Vision, Snowmass, USA, 2020: 624–633.
    [29] TANG Yansong, TIAN Yi, LU Jiwen, et al. Deep progressive reinforcement learning for skeleton-based action recognition[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 5323–5332.
    [30] LI Chao, ZHONG Qiaoyong, XIE Di, et al. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation[C]. The Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 2018: 786–792.
  • 加载中
图(5) / 表(5)
计量
  • 文章访问数:  1118
  • HTML全文浏览量:  1289
  • PDF下载量:  292
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-03-14
  • 修回日期:  2022-07-07
  • 录用日期:  2022-07-14
  • 网络出版日期:  2022-07-21
  • 刊出日期:  2023-04-10

目录

    /

    返回文章
    返回