高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于对比学习的动作识别研究综述

孙中华 吴双 贾克斌 冯金超 刘鹏宇

孙中华, 吴双, 贾克斌, 冯金超, 刘鹏宇. 基于对比学习的动作识别研究综述[J]. 电子与信息学报, 2025, 47(8): 2473-2485. doi: 10.11999/JEIT250131
引用本文: 孙中华, 吴双, 贾克斌, 冯金超, 刘鹏宇. 基于对比学习的动作识别研究综述[J]. 电子与信息学报, 2025, 47(8): 2473-2485. doi: 10.11999/JEIT250131
SUN Zhonghua, WU Shuang, JIA Kebin, FENG Jinchao, LIU Pengyu. A Review on Action Recognition Based on Contrastive Learning[J]. Journal of Electronics & Information Technology, 2025, 47(8): 2473-2485. doi: 10.11999/JEIT250131
Citation: SUN Zhonghua, WU Shuang, JIA Kebin, FENG Jinchao, LIU Pengyu. A Review on Action Recognition Based on Contrastive Learning[J]. Journal of Electronics & Information Technology, 2025, 47(8): 2473-2485. doi: 10.11999/JEIT250131

基于对比学习的动作识别研究综述

doi: 10.11999/JEIT250131 cstr: 32379.14.JEIT250131
基金项目: 北京市自然科学基金(4212001)
详细信息
    作者简介:

    孙中华:男,博士,副教授,硕士生导师,研究方向为计算机视觉与机器学习

    吴双:女,硕士生,研究方向为计算机视觉与机器学习

    贾克斌:男,博士,教授,博士生导师,研究方向为视频信息处理与遥感图像处理

    冯金超:男,博士,教授,博士生导师,研究方向为分子成像与医学图像处理

    刘鹏宇:女,博士,副教授,博士生导师,研究方向为视频编码

    通讯作者:

    孙中华 sunzh@bjut.edu.cn

  • 中图分类号: TN919.81; TP391.4

A Review on Action Recognition Based on Contrastive Learning

Funds: Beijing Natural Science Foundation (4212001)
  • 摘要: 人体动作具有类别数量多、类内/类间差异不均衡等特性,导致动作识别对数据标签数量与质量的依赖度过高,大幅增加了学习模型的训练成本,而对比学习是解决该问题的有效方法之一,近年来基于对比学习的动作识别逐渐成为研究热点。基于此,该文全面论述了对比学习在动作识别中的最新进展,将对比学习的研究分为3大阶段:传统对比学习、基于聚类的对比学习以及不使用负样本的对比学习。在每一阶段,首先概述具有代表性的对比学习模型,然后分析了当前基于该类模型的主要动作识别方法。另外,介绍了主流基准数据集,总结了经典方法在数据集上的性能对比。最后,探讨了对比学习模型在动作识别研究中的局限性和可延展之处。
  • 图  1  SimCLR模型概述

    图  2  MoCo-v2模型概述

    图  3  PCL模型概述

    图  4  基于SwAV模型概述

    图  5  BYOL模型概述

    图  6  SimSiam模型概述

    表  1  所有方法的对比学习结构详情

    方法 年份 模态 数据增强 编码器
    MS2L 2020 S 时间掩蔽,时间拼图 1层双向GRU
    PCRP 2021 S - 3层单向GRU
    AS-CAL 2021 S 旋转、错切、颠倒、高斯噪声、高斯模糊、关节掩蔽、通道掩蔽
    空间:错切,时间:裁剪
    LSTM
    CrosSCLR 2021 S 不同帧率采样视频 ST-GCN
    TCL 2021 V 正常增强(空间:错切,时间:裁剪)
    极端增强(空间:错切、空间翻转、旋转、轴掩模,
    ResNet18/ResNet50
    AimCLR 2022 S 时间:裁剪、时间翻转,时空:高斯噪声、高斯模糊)
    空间:姿态增强、关节抖动
    时间:时域裁剪-调整大小
    ST-GCN
    CMD 2022 S 时空:先时间增强后空间增强,二者随机组合 3层双向GRU
    CPM 2022 S 空间:错切,时间:裁剪 ST-GCN
    Hi-TRS 2022 S 空间:关节替换 时间:置换时序 Hi-TRS
    MAC-Learning 2023 S - GCN
    HiCo 2023 S - S2S
    X-CAR 2023 S 旋转→错切→缩放(带有可学习控制因素) 残差图卷积编码器
    PCM3 2023 S 骨架间:时间裁剪-调整大小、错切、关节抖动
    骨架内:cutMix[37], ResizeMix[38], Mixup[39]
    3层双向GRU
    SDS-CL 2023 S - DSTA[40]
    ST-Graph CMRL 2023 S 时态子图 ST-GCN
    TimeBalance 2023 V - ResNet18/ResNet50
    VARD 2023 V,A 视频:调大小、裁剪、翻转、颜色抖动、缩放、高斯模糊
    音频:替换噪声、增加噪声、基于密度变化的噪声、丢弃
    C3D/R3D-18
    MoLo 2023 V - ResNet50
    HaLP 2023 S 同CMD 3层双向GRU
    Han等人 2023 V,A - V:R(2+1)D-18 A:ResNet-22
    ST-CL 2023 S 时间:移动阶段裁剪 空间:观察视点转换 GCN+TCN
    SkeAttnCLR 2023 S 全局:同CrosSCLR 局部:同SkeleMixCLR[41] ST-GCN
    HYSP 2023 S 同AimCLR ST-GCN
    CSTCN 2024 S 同CMD 3层双向GRU
    Lin等人 2024 S 第1组(正常增强:错切、时间裁剪、关节抖动
    极端增强:翻转、旋转、高斯噪声)
    第2组(空间:cutMix, ResizeMix, Mixup 时间:洗牌)
    3层双向GRU
    下载: 导出CSV

    表  2  对比学习模型的损失函数

    对比学习模型 公式 损失函数类型
    SimCLR $ {L}_{\mathrm{N}\mathrm{T}{\text{-}}\mathrm{X}\mathrm{e}\mathrm{n}\mathrm{t}}=-\mathrm{l}\mathrm{n}\dfrac{\mathrm{e}\mathrm{x}\mathrm{p}\left(\dfrac{\mathrm{s}\mathrm{i}\mathrm{m}\left({\boldsymbol{z}}_{{i}},{\boldsymbol{z}}_{{j}}\right)}{\tau }\right)}{\displaystyle\sum _{k=1\mathbb{l}\left[k\ne i\right]}^{2N}\mathrm{e}\mathrm{x}\mathrm{p}\left(\dfrac{\mathrm{s}\mathrm{i}\mathrm{m}\left({\boldsymbol{z}}_{{i}},{\boldsymbol{z}}_{{k}}\right)}{\tau }\right)} $ NT-Xent
    (InfoNCE+余弦相似度)
    MoCo-v2 $ {L}_{\mathrm{I}\mathrm{n}\mathrm{f}\mathrm{o}\mathrm{N}\mathrm{C}\mathrm{E}}=-\mathrm{l}\mathrm{n}\dfrac{\mathrm{e}\mathrm{x}\mathrm{p}\left({\boldsymbol{z}}_{{q}}\cdot \dfrac{{\boldsymbol{z}}_{{k}}}{\tau }\right)}{\mathrm{e}\mathrm{x}\mathrm{p}\left({\boldsymbol{z}}_{{q}}\cdot\dfrac{{\boldsymbol{z}}_{{k}}}{\tau }\right)+\displaystyle\sum\limits_{i=1}^{M}\mathrm{e}\mathrm{x}\mathrm{p}\left({\boldsymbol{z}}_{{q}}\cdot \dfrac{{\boldsymbol{m}}_{{i}}}{\tau }\right)} $ InfoNCE
    PCL $ {L}_{\mathrm{P}\mathrm{r}\mathrm{o}\mathrm{t}\mathrm{o}\mathrm{N}\mathrm{C}\mathrm{E}}=\dfrac{1}{N}\displaystyle\sum\limits _{n=1}^{N}-\mathrm{l}\mathrm{n}\dfrac{\mathrm{e}\mathrm{x}\mathrm{p}\left({\boldsymbol{v}}_{{n}}\cdot \dfrac{{\boldsymbol{c}}_{{k}}}{{\varphi }_{k}}\right)}{\displaystyle\sum\limits_{i=1}^{K}\mathrm{e}\mathrm{x}\mathrm{p}\left(\dfrac{{\boldsymbol{v}}_{\boldsymbol{n}}{\boldsymbol{c}}_{{i}}}{{\varphi }_{i}}\right)} $ ProtoNCE
    (InfoNCE+聚类)
    SwAV $ \mathcal{l}\left({\boldsymbol{z}}_{{i}},{\boldsymbol{q}}_{{j}}\right)=-\displaystyle\sum\limits_{k}{\boldsymbol{q}}_{{j}}^{\left(k\right)}\mathrm{l}\mathrm{n}\dfrac{\mathrm{e}\mathrm{x}\mathrm{p}\left(\dfrac{{\mathbf{z}}_{{i}}^{\mathrm{T}}{\cdot \boldsymbol{c}}_{{k}}}{\tau }\right)}{\displaystyle\sum\limits_{{{k}}^{{{'}}}}\mathrm{e}\mathrm{x}\mathrm{p}\left(\dfrac{{\boldsymbol{z}}_{{i}}^{\mathrm{T}}{\cdot {\boldsymbol{c}}}_{{k}^{{'}}}}{\tau }\right)} $ InfoNCE+传统交叉熵+聚类
    BYOL $ {L}_{i,j}\triangleq {\|\bar{{q}}\left({\boldsymbol{z}}_{{i}}\right)-{\bar{\boldsymbol{z}}_{{j}}}\|}_{2}^{2}=2-2\cdot \dfrac{\left\langle{{q}\left({\boldsymbol{z}}_{{i}}\right),{\boldsymbol{z}}_{{j}}}\right\rangle}{{\|{q}\left({\boldsymbol{z}}_{{i}}\right)\|}_{2}\cdot {\|{\boldsymbol{z}}_{{j}}\|}_{2}} $ MSE
    SimSiam $ {L}_{\mathrm{S}\mathrm{i}\mathrm{m}\mathrm{S}\mathrm{i}\mathrm{a}\mathrm{m}}=-\left(\dfrac{1}{2}\dfrac{{\boldsymbol{p}}_{{i}}}{{\parallel {\boldsymbol{p}}_{{i}}\parallel }_{2}}\cdot \dfrac{\mathrm{s}\mathrm{g}\left({\boldsymbol{z}}_{{j}}\right)}{{\parallel \mathrm{s}\mathrm{g}\left({\boldsymbol{z}}_{{j}}\right)\parallel }_{2}}+\dfrac{1}{2}\dfrac{{\boldsymbol{p}}_{{j}}}{{\parallel {\boldsymbol{p}}_{{j}}\parallel }_{2}}\cdot \dfrac{\mathrm{s}\mathrm{g}\left({\boldsymbol{z}}_{{i}}\right)}{{\parallel \mathrm{s}\mathrm{g}\left({\boldsymbol{z}}_{{i}}\right)\parallel }_{2}}\right) $ 余弦相似度
    下载: 导出CSV

    表  3  一些主流的动作识别数据集

    数据集 年份 模态 种类数 受试者数 样本数 视点数
    HMDB51[46] 2011 RGB 51 - 6 766 -
    UCF101[47] 2012 RGB 101 - 13 320 -
    NW-UCLA[48] 2014 RGB, S, D 10 10 1 475 3
    NTU-RGB+D[49] 2016 RGB, S, D, IR 60 40 56 880 80
    PKUMMD[50] 2017 RGB, S, D, IR 51 66 1 076 3
    Kinetics-400[51] 2017 RGB 400 - 306 245 -
    Something-Something-V2[52] 2017 RGB 174 - 220 847 -
    NTU-RGB+D 120[53] 2019 RGB, S, D, IR 120 106 114 480 155
    Jester[54] 2019 RGB 27 1376 148 092 -
    Mini-Something-V2[55] 2021 RGB 87 - 94 000 -
    下载: 导出CSV

    表  4  基于对比学习的无监督动作识别方法的性能比较(单流)(%)

    方法 年份 对比学习模型 NTU-RGB+D NTU-RGB+D 120 NW-UCLA PKUMMD
    xsub xview xsub xset I II
    MS2L 2020 SimCLR 52.6 - - 76.8 64.9 27.6
    AS-CAL 2021 MoCo-v2 58.5 64.8 48.6 49.2 - - -
    CrosSCLR 2021 MoCo-v2 72.9 79.9 - - - - -
    PCRP 2021 PCL 54.9 63.4 43.0 44.6 86.1 - -
    AimCLR 2022 MoCo-v2 74.3 79.7 63.4 63.4 - 83.4 -
    CMD 2022 MoCo-v2 79.8 86.9 70.3 71.5 - - 43.0
    CPM 2022 SimSiam 78.7 84.9 68.7 69.6 - 88.8 48.3
    ST-Graph CMRL 2023 SimCLR 74.7 82.6 69.2 68.7 - 83.6 39.9
    ST-CL 2023 SimCLR 68.1 69.4 54.2 55.6 81.2 - -
    HYSP 2023 BYOL 78.2 82.6 61.8 64.6 - 83.8 -
    SkeAttnCLR 2023 MoCo-v2 80.3 86.1 66.3 74.5 - 87.3 52.9
    HiCo 2023 MoCo-v2 81.4 88.8 73.7 74.5 - 89.4 54.7
    Hi-TRS 2023 MoCo-v2 86.0 93.0 80.6 81.6 - - -
    PCM3 2023 MoCo-v2 83.9 90.4 76.5 77.5 - - 51.5
    HaLP 2023 MoCo-v2 79.7 86.8 71.1 72.2 - - 43.5
    CSTCN 2024 SwAV 83.1 88.7 72.5 77.4 - - 48.0
    Lin等人 2024 MoCo-v2 83.9 90.3 75.7 77.2 - 89.7 -
    下载: 导出CSV

    表  5  基于对比学习的无监督动作识别方法的性能比较(多流)(%)

    方法 年份 对比学习模型 NTU-RGB+D NTU-RGB+D 120 PKUMMD
    xsub xview xsub xset I II
    3s-CrosSCLR 2021 MoCo-v2 77.8 83.4 67.9 66.7 - -
    3s-AimCLR 2022 MoCo-v2 78.9 83.8 68.2 68.8 87.8 38.5
    3s-CMD 2022 MoCo-v2 84.1 90.9 74.7 76.1 - 52.6
    3s-CPM 2022 SimSiam 83.2 87.0 73.0 74.0 90.7 51.5
    3s-HYSP 2023 BYOL 79.1 85.2 64.5 67.3 88.8 -
    3s-SkeAttnCLR 2023 MoCo-v2 82.0 86.5 77.1 80.0 89.5 55.5
    Hi-TRS-3S 2023 MoCo-v2 90.0 95.7 85.3 87.4 - -
    3s-PCM3 2023 MoCo-v2 87.4 93.1 80.0 81.2 - 58.2
    3s-CSTCN 2024 SwAV 85.8 92.0 77.5 78.5 - 53.9
    3s-Lin等人 2024 MoCo-v2 87.0 92.9 79.4 81.2 91.7
    下载: 导出CSV

    表  6  基于对比学习的半监督动作识别方法的性能比较(%)

    方法 年份 对比学习模型 NTU-RGB+D NW-UCLA PKUMMD
    xsub xview I II
    MS2L 2020 SimCLR 65.2 - 70.3 26.1
    AS-CAL 2021 MoCo-v2 52.2 57.2 - - -
    3s-CrosSCLR 2021 MoCo-v2 74.4 77.8 - 82.9 28.6
    3s-AimCLR 2022 MoCo-v2 78.2 81.6 - 86.1 33.4
    X-CAR 2022 BYOL 76.1 78.2 68.7 - -
    3s-CMD 2022 MoCo-v2 79.0 82.4 - - -
    CPM 2022 SimSiam 73.0 77.1 - - -
    CMD 2022 MoCo-v2 75.4 80.2 - - -
    3s-HYSP 2023 BYOL 80.5 85.4 - 88.7 -
    MAC-Learning 2023 SimCLR 74.2 78.5 63.0 - -
    3s-SkeAttnCLR 2023 MoCo-v2 81.5 83.8 - - -
    HiCo 2023 MoCo-v2 73.0 78.3 - - -
    Hi-TRS-3S 2023 MoCo-v2 77.7 81.1 - - -
    PCM3 2023 MoCo-v2 77.1 82.8 - - -
    SDS-CL 2023 SimCLR 77.2 83.0 67.0 - -
    3s-CSTCN 2024 SwAV 84.7 87.5 - - -
    3s-Lin等人 2024 MoCo-v2 81.7 86.0 - - -
    下载: 导出CSV

    表  7  来自其余数据集的基于对比学习的动作识别方法的性能比较(%)

    方法 年份 对比学习模型 监督模式 UCF101 Kinetics-400 Jester Mini-Something-V2 HMDB51
    TCL 2021 SimCLR 半监督 - 30.28(5) 94.64 40.7 -
    TimeBalance 2023 SimCLR 半监督 81.1 61.2 - - 54.5(60)
    MoLo 2023 SimCLR 5-样本 95.5 85.6 - 56.4 77.4
    VARD 2023 SimSiam 无监督 72.6 - - - 34.9
    下载: 导出CSV
  • [1] CHEN Ting, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]. The 37th International Conference on Machine Learning, Vienna, Austria, 2020: 149.
    [2] CHEN Xinlei, FAN Haoqi, GIRSHICK R, et al. Improved baselines with momentum contrastive learning[EB/OL]. https://doi.org/10.48550/arXiv.2003.04297, 2020.
    [3] LIN Lilang, SONG Sijie, YANG Wenhan, et al. MS2L: Multi-task self-supervised learning for skeleton based action recognition[C]. The 28th ACM International Conference on Multimedia, Seattle, USA, 2020: 2490–2498. doi: 10.1145/3394171.3413548.
    [4] SINGH A, CHAKRABORTY O, VARSHNEY A, et al. Semi-supervised action recognition with temporal contrastive learning[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 10384–10394. doi: 10.1109/CVPR46437.2021.01025.
    [5] DAVE I R, RIZVE M N, CHEN C, et al. TimeBalance: Temporally-invariant and temporally-distinctive video representations for semi-supervised action recognition[C]. Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 2341–2352. doi: 10.1109/CVPR52729.2023.00232.
    [6] GAO Xuehao, YANG Yang, ZHANG Yimeng, et al. Efficient spatio-temporal contrastive learning for skeleton-based 3-D action recognition[J]. IEEE Transactions on Multimedia, 2023, 25: 405–417. doi: 10.1109/TMM.2021.3127040.
    [7] XU Binqian, SHU Xiangbo, ZHANG Jiachao, et al. Spatiotemporal decouple-and-squeeze contrastive learning for semisupervised skeleton-based action recognition[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(8): 11035–11048. doi: 10.1109/TNNLS.2023.3247103.
    [8] BIAN Cunling, FENG Wei, MENG Fanbo, et al. Global-local contrastive multiview representation learning for skeleton-based action recognition[J]. Computer Vision and Image Understanding, 2023, 229: 103655. doi: 10.1016/j.cviu.2023.103655.
    [9] WANG Xiang, ZHANG Shiwei, QING Zhiwu, et al. MoLo: Motion-augmented long-short contrastive learning for few-shot action recognition[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 18011–18021. doi: 10.1109/CVPR52729.2023.01727.
    [10] SHU Xiangbo, XU Binqian, ZHANG Liyan, et al. Multi-granularity anchor-contrastive representation learning for semi-supervised skeleton-based action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 7559–7576. doi: 10.1109/TPAMI.2022.3222871.
    [11] WU Zhirong, XIONG Yuanjun, YU S X, et al. Unsupervised feature learning via non-parametric instance discrimination[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 3733–3742. doi: 10.1109/CVPR.2018.00393.
    [12] VAN DEN OORD A, LI Yazhe, and VINYALS O. Representation learning with contrastive predictive coding[EB/OL]. http://arxiv.org/abs/1807.03748, 2018.
    [13] RAO Haocong, XU Shihao, HU Xiping, et al. Augmented skeleton based contrastive action learning with momentum LSTM for unsupervised action recognition[J]. Information Sciences, 2021, 569: 90–109. doi: 10.1016/j.ins.2021.04.023.
    [14] LI Linguo, WANG Minsi, NI Bingbing, et al. 3D human action representation learning via cross-view consistency pursuit[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 4739–4748. doi: 10.1109/CVPR46437.2021.00471.
    [15] HUA Yilei, WU Wenhan, ZHENG Ce, et al. Part aware contrastive learning for self-supervised action recognition[C]. The Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 2023: 855–863. doi: 10.24963/ijcai.2023/95.
    [16] GUO Tianyu, LIU Hong, CHEN Zhan, et al. Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition[C]. The 36th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2022: 762–770. doi: 10.1609/aaai.v36i1.19957.
    [17] SHAH A, ROY A, SHAH K, et al. HaLP: Hallucinating latent positives for skeleton-based self-supervised learning of actions[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 18846–18856. doi: 10.1109/CVPR52729.2023.01807.
    [18] MAO Yunyao, ZHOU Wengang, LU Zhenbo, et al. CMD: Self-supervised 3D action representation learning with cross-modal mutual distillation[C]. The 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 734–752. doi: 10.1007/978-3-031-20062-5_42.
    [19] ZHANG Jiahang, LIN Lilang, and LIU Jiaying. Prompted contrast with masked motion modeling: Towards versatile 3D action representation learning[C]. The 31st ACM International Conference on Multimedia, Ottawa, Canada, 2023: 7175–7183. doi: 10.1145/3581783.3611774.
    [20] LIN Lilang, ZHANG Jiahang, and LIU Jiaying. Mutual information driven equivariant contrastive learning for 3D action representation learning[J]. IEEE Transactions on Image Processing, 2024, 33: 1883–1897. doi: 10.1109/TIP.2024.3372451.
    [21] DONG Jianfeng, SUN Shengkai, LIU Zhonglin, et al. Hierarchical contrast for unsupervised skeleton-based action representation learning[C]. The Thirty-Seventh AAAI Conference on Artificial Intelligence, Washington, USA, 2023: 525–533. doi: 10.1609/aaai.v37i1.25127.
    [22] CHEN Yuxiao, ZHAO Long, YUAN Jianbo, et al. Hierarchically self-supervised transformer for human skeleton representation learning[C]. The 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 185–202. doi: 10.1007/978-3-031-19809-0_11.
    [23] LI Junnan, ZHOU Pan, XIONG Caiming, et al. Prototypical contrastive learning of unsupervised representations[C]. The 9th International Conference on Learning Representations, Virtual Event, Austria, 2021.
    [24] DEMPSTER A P, LAIRD N M, and RUBIN D B. Maximum likelihood from incomplete data via the EM algorithm[J]. Journal of the Royal Statistical Society: Series B (Methodological), 1977, 39(1): 1–22. doi: 10.1111/j.2517-6161.1977.tb01600.x.
    [25] XU Shihao, RAO Haocong, HU Xiping, et al. Prototypical contrast and reverse prediction: Unsupervised skeleton based action recognition[J]. IEEE Transactions on Multimedia, 2023, 25: 624–634. doi: 10.1109/TMM.2021.3129616.
    [26] ZHOU Huanyu, LIU Qingjie, and WANG Yunhong. Learning discriminative representations for skeleton based action recognition[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 10608–10617. doi: 10.1109/CVPR52729.2023.01022.
    [27] CARON M, MISRA I, MAIRAL J, et al. Unsupervised learning of visual features by contrasting cluster assignments[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 831.
    [28] WANG Mingdao, LI Xueming, CHEN Siqi, et al. Learning representations by contrastive spatio-temporal clustering for skeleton-based action recognition[J]. IEEE Transactions on Multimedia, 2024, 26: 3207–3220. doi: 10.1109/TMM.2023.3307933.
    [29] HAN Haochen, ZHENG Qinghua, LUO Minnan, et al. Noise-tolerant learning for audio-visual action recognition[J]. IEEE Transactions on Multimedia, 2024, 26: 7761–7774. doi: 10.1109/TMM.2024.3371220.
    [30] GRILL J B, STRUB F, ALTCHÉ F, et al. Bootstrap your own latent a new approach to self-supervised learning[C]. Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 1786.
    [31] XU Binqian, SHU Xiangbo, and SONG Yan. X-invariant contrastive augmentation and representation learning for semi-supervised skeleton-based action recognition[J]. IEEE Transactions on Image Processing, 2022, 31: 3852–3867. doi: 10.1109/TIP.2022.3175605.
    [32] FRANCO L, MANDICA P, MUNJAL B, et al. Hyperbolic self-paced learning for self-supervised skeleton-based action representations[C]. The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 2023: 1–5.
    [33] KUMAR M P, PACKER B, and KOLLER D. Self-paced learning for latent variable models[C]. The 24th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2010: 1189–1197.
    [34] CHEN Xinlei and HE Kaiming. Exploring simple siamese representation learning[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 15745–15753. doi: 10.1109/CVPR46437.2021.01549.
    [35] ZHANG Haoyuan, HOU Yonghong, ZHANG Wenjing, et al. Contrastive positive mining for unsupervised 3D action representation learning[C]. Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 36–51. doi: 10.1007/978-3-031-19772-7_3.
    [36] LIN Wei, DING Xinghao, HUANG Yue, et al. Self-supervised video-based action recognition with disturbances[J]. IEEE Transactions on Image Processing, 2023, 32: 2493–2507. doi: 10.1109/TIP.2023.3269228.
    [37] YUN S, HAN D, CHUN S, et al. CutMix: Regularization strategy to train strong classifiers with localizable features[C]. 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 2019: 6022–6031. doi: 10.1109/ICCV.2019.00612.
    [38] REN Sucheng, WANG Huiyu, GAO Zhengqi, et al. A simple data mixing prior for improving self-supervised learning[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 14575–14584. doi: 10.1109/CVPR52688.2022.01419.
    [39] ZHANG Hongyi, CISSE M, DAUPHIN Y N, et al. mixup: Beyond empirical risk minimization[C]. The 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
    [40] SHI Lei, ZHANG Yifan, CHENG Jian, et al. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition[C]. Proceedings of the 15th Asian Conference on Computer Vision, Kyoto, Japan, 2020: 38–53. doi: 10.1007/978-3-030-69541-5_3.
    [41] ZHAN Chen, LIU Hong, GUO Tianyu, et al. Contrastive learning from spatio-temporal mixed skeleton sequences for self-supervised skeleton-based action recognition[EB/OL]. https://arxiv.org/abs/2207.03065, 2022.
    [42] LEE I, KIM D, KANG S, et al. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks[C]. 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 1012–1020. doi: 10.1109/ICCV.2017.115.
    [43] YAN Sijie, XIONG Yuanjun, and LIN Dahua. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]. The 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018: 912. doi: 10.1609/aaai.v32i1.12328.
    [44] LE-KHAC P H, HEALY G, and SMEATON A F. Contrastive representation learning: A framework and review[J]. IEEE Access, 2020, 8: 193907–193934. doi: 10.1109/ACCESS.2020.3031549.
    [45] 张重生, 陈杰, 李岐龙, 等. 深度对比学习综述[J]. 自动化学报, 2023, 49(1): 15–39. doi: 10.16383/j.aas.c220421.

    ZHANG Chongsheng, CHEN Jie, LI Qilong, et al. Deep contrastive learning: A survey[J]. Acta Automatica Sinica, 2023, 49(1): 15–39. doi: 10.16383/j.aas.c220421.
    [46] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A large video database for human motion recognition[C]. 2011 International Conference on Computer Vision, Barcelona, Spain, 2011: 2556–2563. doi: 10.1109/ICCV.2011.6126543.
    [47] SOOMRO K, ZAMIR A R, and SHAH M. UCF101: A dataset of 101 human actions classes from videos in the wild[EB/OL]. https://doi.org/10.48550/arXiv.1212.0402, 2012.
    [48] WANG Jiang, NIE Xiaohan, XIA Yin, et al. Cross-view action modeling, learning, and recognition[C]. 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 2649–2656. doi: 10.1109/CVPR.2014.339.
    [49] SHAHROUDY A, LIU Jun, NG T-T, et al. NTU RGB+D: A large scale dataset for 3D human activity analysis[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 1010–1019. doi: 10.1109/CVPR.2016.115.
    [50] LIU Chunhui, HU Yueyu, LI Yanghao, et al. PKU-MMD: A large scale benchmark for skeleton-based human action understanding[C]. Proceedings of the Workshop on Visual Analysis in Smart and Connected Communities, Mountain View, USA, 2017: 1–8. doi: 10.1145/3132734.3132739.
    [51] KAY W, CARREIRA J, SIMONYAN K, et al. The Kinetics human action video dataset[EB/OL]. http://arxiv.org/abs/1705.06950, 2017.
    [52] GOYAL R, KAHOU S E, MICHALSKI V, et al. The “Something Something” video database for learning and evaluating visual common sense[C]. 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5843–5851. doi: 10.1109/ICCV.2017.622.
    [53] LIU Jun, SHAHROUDY A, PEREZ M, et al. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684–2701. doi: 10.1109/TPAMI.2019.2916873.
    [54] MATERZYNSKA J, BERGER G, BAX I, et al. The Jester dataset: A large-scale video dataset of human gestures[C]. 2019 IEEE/CVF International Conference on Computer Vision Workshop, Seoul, Korea, 2019: 2874–2882. doi: 10.1109/ICCVW.2019.00349.
    [55] CHEN C F R, PANDA R, RAMAKRISHNAN K, et al. Deep analysis of CNN-based spatio-temporal representations for action recognition[C]. Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 6161–6171. doi: 10.1109/CVPR46437.2021.00610.
  • 加载中
图(6) / 表(7)
计量
  • 文章访问数:  315
  • HTML全文浏览量:  242
  • PDF下载量:  39
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-03-05
  • 修回日期:  2025-06-01
  • 网络出版日期:  2025-06-14
  • 刊出日期:  2025-08-27

目录

    /

    返回文章
    返回