Skeleton Action Recognition Based on Spatio-temporal Feature Enhanced Graph Convolutional Network
摘要: 针对骨架行为识别不能充分挖掘时空特征的问题,该文提出一种基于时空特征增强的图卷积行为识别模型(STFE-GCN)。首先,介绍表征人体拓扑结构邻接矩阵的定义及双流自适应图卷积网络模型的结构,其次,采用空域上的图注意力机制,根据邻居节点的重要性程度分配不同的权重系数,生成可充分挖掘空域结构特征的注意力系数矩阵,并结合非局部网络生成的全局邻接矩阵,提出一种新的空域自适应邻接矩阵,以期增强对人体空域结构特征的提取;然后,时域上采用混合池化模型以提取时域关键动作特征和全局上下文特征,并结合时域卷积提取的特征,以期增强对行为信息中时域特征的提取。再者,在模型中引入改进通道注意力网络(ECA-Net)进行通道注意力增强,更有利于模型提取样本的时空特征,同时结合空域特征增强、时域特征增强和通道注意力,构建时空特征增强图卷积网络模型在多流网络下实现端到端的训练,以期实现时空特征的充分挖掘。最后,在NTU-RGB+D和NTU-RGB+D120两个大型数据集上开展骨架行为识别研究,实验结果表明该模型具有优秀的识别准确率和泛化能力,也进一步验证了该模型充分挖掘时空特征的有效性。Abstract: Considering the problem that skeleton action recognition can not fully exploit spatio-temporal features, a skeleton action recognition model based on Spatio-Temporal Feature Enhanced Graph Convolutional Network (STFE-GCN) is proposed in this paper. Firstly, the definition of adjacency matrix representing the topological structure of human body and the structure of one two-stream self-adaptive graph convolutional network model are introduced. Secondly, the graph attention network in spatial domain is used to assign different weight coefficients according to the importance of the neighbor nodes to generate an attention coefficient matrix, which can fully extract the spatial structure features of human body. Furthermore, a new spatial self-adaptive adjacency matrix is proposed to enhance furtherly the extraction of spatial structure features of human body combined with the global adjacency matrix generated by the non-local network; Then, a mixed pooling model is utilized in temporal domain to extract key action features and global contextual features, these two-above features can be furtherly combined with the features generated by the temporal convolution to enhance the extraction of temporal features from behavioral informations. Furthermore, an Efficient Channel Attention Network (ECA-Net)is introduced for channel attention to better extract the spatio-temporal features of the samples. Meanwhile, combining the spatial feature enhanced, the temporal feature enhanced with the channel attention, an novel model referred to as STFE-GCN is constructed and one end-to-end training can be realized based on mutil-stream network to achieve the full mining of spatio-temporal features. Finally, the researches on skeleton action recognition are carried on NTU-RGB+D and NTU-RGB+D120 datasets. The results show that this model has superior classification accuracy and generalization ability, which also further verifies the effectiveness of the model to fully mine spatio-temporal features.
表 1 STFE-GCN模型不同层数识别准确率对比(%)
7层STFE-GCN 8层STFE-GCN 9层STFE-GCN 10层STFE-GCN 11层STFE-GCN 关节流 93.6 93.9 94.0 94.4 94.1 骨架流
95.4表 2 时域不同卷积核大小的识别准确率对比(%)
5×1 7×1 9×1 11×1 13×1 关节流 94.5 94.3 94.4 94.0 93.8 骨架流
95.3表 3 NTU-RGB+D数据集X-View下消融实验的识别准确率(%)
模型 关节流 骨架流 双流 2s-AGCN 93.7 93.2 95.1 2s-AGCN+图注意力机制 94.1 93.4 95.3 2s-AGCN+混合池化 94.0 93.6 95.3 2s-AGCN+ECA-Net 94.0 94.1 95.4 STFE-GCN 94.4 94.3 95.6 表 4 NTU-RGB+D数据集上STFE-GCN模型各支流的识别准确率(%)
情景 关节流 骨架流 关节运动流 骨架运动流 双流 多流 X-View 94.4 94.3 92.8 93.0 95.6 96.0 X-Sub 87.7 87.4 85.7 85.6 89.3 89.8 表 5 NTU-RGB+D数据集上不同模型的识别准确率对比(%)
