基于双流-非局部时空残差卷积神经网络的人体行为识别

钱惠敏; 陈实; 皇甫晓瑛

doi:10.11999/JEIT230168

基于双流-非局部时空残差卷积神经网络的人体行为识别

doi: 10.11999/JEIT230168

河海大学南京 211100

详细信息

作者简介:
钱惠敏：女，副教授，硕士生导师，研究方向为智能视频监控系统、视频中的人体行为分析、深度学习等

陈实：男，硕士生，研究方向为视频中的人体行为分析

皇甫晓瑛：女，硕士生，研究方向为视频中的人体行为分析

通讯作者:
陈实　1374400532@qq.com

11 https://github.com/open-mmlab/mmaction
中图分类号: TN911.73; TP391.41
计量
- 文章访问数: 337
- HTML全文浏览量: 96
- PDF下载量: 61
- 被引次数: 0
出版历程
- 收稿日期: 2023-03-16
- 修回日期: 2023-07-05
- 网络出版日期: 2023-07-10
- 刊出日期: 2024-03-27

Human Activities Recognition Based on Two-stream NonLocal Spatial Temporal Residual Convolution Neural Network

Hohai University, Nanjing 211100, China

摘要

摘要: 3维卷积神经网络(3D CNN)与双流卷积神经网络(two-stream CNN)是视频中人体行为识别研究的常用架构，且各有优势。该文旨在研究结合两种架构且复杂度低、识别精度高的人体行为识别模型。具体地，该文提出基于通道剪枝的双流-非局部时空残差卷积神经网络(TPNLST-ResCNN)，该网络采用双流架构，分别在时间流子网络和空间流子网络采用时空残差卷积神经网络(ST-ResCNN)，并采用均值融合算法融合两个子网络的识别结果。进一步地，为了降低网络的复杂度，该文提出了针对时空残差卷积神经网络的通道剪枝方案，在实现模型压缩的同时，可基本保持模型的识别精度；为了使得压缩后网络能更好地学习到输入视频中人体行为变化的长距离时空依赖关系，提高网络的识别精度，该文提出在剪枝后网络的首个残差型时空卷积块前引入一个非局部模块。实验结果表明，该文提出的人体行为识别模型在公共数据集UCF101和HMDB51上的识别准确率分别为98.33%和74.63%。与现有方法相比，该文模型具有参数量小、识别精度高的优点。
- 人体行为识别 /
- 双流卷积神经网络 /
- 3维卷积神经网络 /
- 网络剪枝 /
- 非局部模块
Abstract: Three-Dimensional Convolution Neural Network (3D CNN) and two-stream Convolution Neural Network (two-stream CNN) are commonly-used for human activities recognition, and each has its own advantages. A human activities recognition model with low complexity and high recognition accuracy is proposed by combining the two architectures. Specifically, a Two-stream NonLocal Spatial Temporal Residual Convolution Neural Network based onchannel Pruning (TPNLST-ResCNN) is proposed in this paper. And Spatial Temporal Residual Convolution Neural Networks (ST-ResCNN) are used both in the temporal stream subnetwork and the spatial stream subnetwork. The final recognition results are acquired by fusing the recognition results of the two subnetworks under a mean fusion algorithm. Furthermore, in order to reduce the complexity of the network, a channel pruning scheme for ST-ResCNN is presented to achieve model compression. In order to enable the compressed network to learn the long-distance spatiotemporal dependencies of human activity changes better and improve the recognition accuracy of the network, a nonlocal block is introduced before the first residual spatial temporal convolution block of the pruned network. The experimental results show that the recognition accuracies of the proposed human activity recognition model are 98.33% and 74.63% on the public dataset UCF101 and HMDB51, respectively. Compared with the existed algorithms, the proposed model in this paper has fewer parameters and higher recognition accuracy.
- Human activities recognition /
- Two-stream Convolution Neural Network(two-stream CNN) /
- 3D Convolution Neural Network(3D CNN) /
- Network pruning /
- NonLocal block

HTML全文

图 1 双流-非局部时空残差卷积神经网络

下载: 全尺寸图片幻灯片

图 2 时间卷积层的通道剪枝示意图

下载: 全尺寸图片幻灯片

图 3 剪枝方案示意图

下载: 全尺寸图片幻灯片

图 4 非局部模块的网络结构

下载: 全尺寸图片幻灯片

表 1 不同网络深度ST-ResCNN的结构及其识别精度

网络模型	网络层数	参数量(M)	Conv2(个)	Conv3(个)	Conv4(个)	Conv5(个)	精度(%)
A	10	14.38	1	1	1	1	57.70
B	12	15.26	1	2	1	1	55.65
C	12	17.92	1	1	2	1	55.80
D	12	28.54	1	1	1	2	55.17

下载: 导出CSV

表 2 融合实验结果对比(%)

数据集	空间流	时间流	最大值融合	均值融合
UCF101	94.60	85.67	97.70	98.00
HMDB51	58.63	50.15	62.80	69.20

下载: 导出CSV

表 3 UCF101和HMDB51上剪枝的实验结果(%)

数据集	子网络	剪枝阈值	模型压缩	精度	融合精度
UCF101	空间流	70	41.70	92.13	96.83
UCF101	时间流	80	41.70	81.96	96.83
HMDB51	空间流	40	37.97	59.11	72.27
HMDB51	时间流	30	27.89	54.97	72.27

下载: 导出CSV

表 4 提高输入帧长后网络的识别精度对比

数据集	输入帧长	剪枝前的精度(%)	剪枝后的精度(%)
UCF101	8	96.88	96.83
UCF101	16	98.00	97.75
HMDB51	8	62.10	72.27
HMDB51	16	69.20	73.01

下载: 导出CSV

表 5 3种网络的对比实验(输入帧长为16、均值融合)

网络名称	数据集	参数量(M)	精度(%)	精度变化(%)
ST-ResCNN	HMDB51	28.76	69.20	+7.10
ST-ResCNN	UCF101	127.08	98.00	+1.12
PST-ResCNN	HMDB51	19.29	73.01	+0.74
PST-ResCNN	UCF101	70.29	97.75	+0.92
PNLST-ResCNN	HMDB51	20.11	74.63	+1.53
PNLST-ResCNN	UCF101	71.68	98.33	+4.67

下载: 导出CSV

表 6 本文算法与现有算法的比较

算法	输入	预训练数据集	参数量(M)	精度(%)
算法	输入	预训练数据集	参数量(M)	UCF101 HMDB51
C3D^[2]	RGB	Sports-1M	61.63	82.3	56.8
P3D^[5]	RGB	Sports-1M	–	88.6	–
R3D-34^[21]	RGB	Kinetics-700	63.52	88.8	59.5
R(2+1)D-50^[21]	RGB	Kinetics-700+Sports1M	53.95	93.4	69.4
CIDC^[11]	RGB	–	103.00	97.9	75.2
ActionCLIP^[22]	RGB	网络数据	85.58	97.1	76.2
STM(ResNet-50)^[23]	RGB	ImageNet+Kinetics	–	96.2	72.2
TDN(ResNet-50)^[11]	RGB	ImageNet+Kinetics	–	97.4	76.3
R(2+1)D-34^[24]	双流	Sports-1M	127.08	95.0	72.7
本文PNLST-ResCNN-34	双流	Kineticts-400	71.68	98.3	–
本文PNLST-ResCNN-10	双流	–	20.11	–	74.6

下载: 导出CSV

参考文献(24)

[1]	白静, 杨瞻源, 彭斌, 等. 三维卷积神经网络及其在视频理解领域中的应用研究[J]. 电子与信息学报, 2023, 45(6): 2273–2283. doi: 10.11999/JEIT220596. BAI Jing, YANG Zhanyuan, PENG Bin, et al. Research on 3D convolutional neural network and its application to video understanding[J]. Journal of Electronics &Information Technology, 2023, 45(6): 2273–2283. doi: 10.11999/JEIT220596.
[2]	CARREIRA J and ZISSERMAN A. QUO Vadis, action recognition? A new model and the kinetics dataset[C]. The 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 6299–6308.
[3]	QIU Zhaofan, YAO Ting, and MEI Tao. Learning spatio-temporal representation with pseudo-3D residual networks[C]. The 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5534–5542.
[4]	王粉花, 张强, 黄超, 等. 融合双流三维卷积和注意力机制的动态手势识别[J]. 电子与信息学报, 2021, 43(5): 1389–1396. doi: 10.11999/JEIT200065. WANG Fenhua, ZHANG Qiang, HUANG Chao, et al. Dynamic gesture recognition combining two-stream 3D convolution with attention mechanisms[J]. Journal of Electronics &Information Technology, 2021, 43(5): 1389–1396. doi: 10.11999/JEIT200065.
[5]	PANG Chen, LU Xuequan, and LYU Lei. Skeleton-based action recognition through contrasting two-stream spatial-temporal networks[J]. IEEE Transactions on Multimedia, 2023, 1520–9210.
[6]	VARSHNEY N and BAKARIYA B. Deep convolutional neural model for human activities recognition in a sequence of video by combining multiple CNN streams[J]. Multimedia Tools and Applications, 2022, 81(29): 42117–42129. doi: 10.1007/s11042-021-11220-4.
[7]	LI Bing, CUI Wei, WANG Wei, et al. Two-stream convolution augmented transformer for human activity recognition[C]. The 35th AAAI Conference on Artificial Intelligence, 2021: 286–293.
[8]	ILG E, MAYER N, SAIKIA T, et al. Flownet 2.0: Evolution of optical flow estimation with deep networks[C]. The 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 1647–1655.
[9]	SUN Deqing, YANG Xiaodong, LIU Mingyu, et al. PWC-net: CNNs for optical flow using pyramid, warping, and cost volume[C]. The 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 8934–8943.
[10]	WEI S E, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines[C]. The 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 4724–4732.
[11]	LI Xinyu, SHUAI Bing, and TIGHE J. Directional temporal modeling for action recognition[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 275–291.
[12]	WANG Xiaolong, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]. The 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7794–7803.
[13]	HUANG Min, QIAN Huimin, HAN Yi, et al. R(2+1)D-based two-stream CNN for human activities recognition in videos[C]. The 40th Chinese Control Conference, Shanghai, China, 2021: 7932–7937.
[14]	LIU Zhuang, LI Jianguo, SHEN Zhiqiang, et al. Learning efficient convolutional networks through network slimming[C]. The 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 2755–2763.
[15]	VAROL G, LAPTEV I, and SCHMID C. Long-term temporal convolutions for action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1510–1517. doi: 10.1109/TPAMI.2017.2712608.
[16]	SOOMRO K, ZAMIR A R, and SHAH M. UCF101: A dataset of 101 human actions classes from videos in the wild[OL]. arXiv preprint arXiv: 1907.06987, 2012.
[17]	KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A large video database for human motion recognition[C]. The 2011 International Conference on Computer Vision, Barcelona, Spain, 2011: 2556−2563.
[18]	KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[OL]. arXiv preprint arXiv: 1705.06950, 2017.
[19]	CARREIRA J, NOLAND E, HILLIER C, et al. A short note on the kinetics-700 human action dataset[OL]. arXiv preprint arXiv: 1907.06987, 2019.
[20]	KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]. The 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1725–1732.
[21]	KATAOKA H, WAKAMIYA T, HARA K, et al. Would mega-scale datasets further enhance spatiotemporal 3D CNNs?[OL]. arXiv preprint arXiv: 2004.04968, 2020.
[22]	WANG Mengmeng, XING Jiazheng, and LIU Yong. ActionCLIP: A new paradigm for video action recognition[J]. arXiv preprint arXiv: 2109.08472, 2021.
[23]	JIANG Boyuan, WANG Mengmeng, GAN Weihao, et al. STM: SpatioTemporal and motion encoding for action recognition[C]. The 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 2000–2009.
[24]	TRAN D, WANG Heng, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]. The 2018 IEEE/CVF conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6450–6459.