Citation: | LIU Jie, WANG Yue, TIAN Ming. Dynamic Gesture Recognition Network Based on Multiscale Spatiotemporal Feature Fusion[J]. Journal of Electronics & Information Technology, 2023, 45(7): 2614-2622. doi: 10.11999/JEIT220758 |
[1] |
张淑军, 张群, 李辉. 基于深度学习的手语识别综述[J]. 电子与信息学报, 2020, 42(4): 1021–1032. doi: 10.11999/JEIT190416
ZHANG Shujun, ZHANG Qun, and LI Hui. review of sign language recognition based on deep learning[J]. Journal of Electronics &Information Technology, 2020, 42(4): 1021–1032. doi: 10.11999/JEIT190416
|
[2] |
ASADI-AGHBOLAGHI M, CLAPÉS A, BELLANTONIO M, et al. A survey on deep learning based approaches for action and gesture recognition in image sequences[C]. The 12th IEEE International Conference on Automatic Face & Gesture Recognition, Washington, USA, 2017: 476–483.
|
[3] |
KOLLER O, NEY H, and BOWDEN R. Deep hand: How to train a CNN on 1 million hand images when your data is continuous and weakly labelled[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 3793–3802.
|
[4] |
WU J, ISHWAR P, and KONRAD J. Two-stream CNNs for gesture-based verification and identification: Learning user style[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, USA, 2016: 110–118.
|
[5] |
JI Shuiwang, XU Wei, YANG Ming, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221–231. doi: 10.1109/TPAMI.2012.59
|
[6] |
HUANG Jie, ZHOU Wengang, LI Houqiang, et al. Sign language recognition using 3D convolutional neural networks[C] Proceedings of 2015 IEEE International Conference on Multimedia and Expo, Turin, Italy, 2015: 1–6.
|
[7] |
LIU Zhi, ZHANG Chenyang, and TIAN Yingli. 3D-based deep convolutional neural network for action recognition with depth sequences[J]. Image and Vision Computing, 2016, 55(2): 93–100. doi: 10.1016/j.imavis.2016.04.004
|
[8] |
王粉花, 张强, 黄超, 等. 融合双流三维卷积和注意力机制的动态手势识别[J]. 电子与信息学报, 2021, 43(5): 1389–1396. doi: 10.11999/JEIT200065
WANG Fenhua, ZHANG Qiang, HUANG Chao, et al. Dynamic gesture recognition combining two-stream 3D convolution with attention mechanisms[J]. Journal of Electronics &Information Technology, 2021, 43(5): 1389–1396. doi: 10.11999/JEIT200065
|
[9] |
TRAN D, RAY J, SHOU Zheng, et al. ConvNet architecture search for spatiotemporal feature learning[EB/OL]. https://arxiv.org/abs/1708.05038, 2017.
|
[10] |
HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
|
[11] |
TRAN D, WANG Heng, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6450–6459.
|
[12] |
FEICHTENHOFER C, FAN Haoqi, MALIK J, et al. SlowFast networks for video recognition[C]. 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 6201–6210.
|
[13] |
ZHANG Can, ZOU Yuexian, CHEN Guang, et al. PAN: Towards fast action recognition via learning persistence of appearance[EB/OL].https://arxiv.org/abs/2008.03462, 2020.
|
[14] |
胡凯, 陈旭, 朱俊, 等. 基于多尺度3D卷积神经网络的行为识别方法[J]. 重庆邮电大学学报:自然科学版, 2021, 33(6): 970–976. doi: 10.3979/j.issn.1673-825X.201910240366
HU Kai, CHEN Xu, ZHU Jun, et al. Multiscale 3D convolutional neural network for action recognition[J]. Journal of Chongqing University of Posts and Telecommunications:Natural Science Edition, 2021, 33(6): 970–976. doi: 10.3979/j.issn.1673-825X.201910240366
|
[15] |
GAO Zan, GUO Leming, GUAN Weili, et al. A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-R2[J]. IEEE Transactions on Image Processing, 2021, 30: 767–782. doi: 10.1109/TIP.2020.3038372
|
[16] |
毛力, 张艺楠, 孙俊. 融合注意力与时域多尺度卷积的手势识别算法[J]. 计算机应用研究, 2022, 39(7): 2196–2202. doi: 10.19734/j.issn.1001-3695.2021.11.0620
MAO Li, ZHANG Yinan, and SUN Jun. Gesture recognition algorithm combining attention and time-domain multiscale convolution[J]. Application Research of Computers, 2022, 39(7): 2196–2202. doi: 10.19734/j.issn.1001-3695.2021.11.0620
|
[17] |
SHARIR G, NOY A, and ZELNIK-MANOR L. An image is worth 16x16 words, what is a video worth?[EB/OL]. https://arxiv.org/abs/2103.13915, 2021.
|
[18] |
DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[EB/OL]. https://arxiv.org/abs/2010.11929, 2020.
|
[19] |
WU Haiping, XIAO Bin, CODELLA N, et al. CvT: Introducing convolutions to vision transformers[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021.
|
[20] |
LIANG Xiaobo, WU Lijun, LI Juntao, et al. R-Drop: Regularized dropout for neural networks[C/OL]. Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems, 2021.
|
[21] |
谢昭, 周义, 吴克伟, 等. 基于时空关注度LSTM的行为识别[J]. 计算机学报, 2021, 44(2): 261–274. doi: 10.11897/SP.J.1016.2021.00261
XIE Zhao, ZHOU Yi, WU Kewei, et al. Activity recognition based on spatial-temporal attention LSTM[J]. Chinese Journal of Computers, 2021, 44(2): 261–274. doi: 10.11897/SP.J.1016.2021.00261
|
[22] |
SHI Xingjian, CHEN Zhourong, WANG Hao, et al. Convolutional LSTM Network: A machine learning approach for precipitation nowcasting[C]. The 28th International Conference on Neural Information Processing Systems, Montreal, Canada, 2015: 802–810.
|