Human Action Recognition via Spatio-temporal Dual Network Flow and Visual Attention Fusion

Tianliang LIU; Qingwei QIAO; Junwei WAN; Xiubin DAI; Jiebo LUO

doi:10.11999/JEIT171116

Volume 40 Issue 10

Sep. 2018

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2018 > 40(10): 2395-2401

Tianliang LIU, Qingwei QIAO, Junwei WAN, Xiubin DAI, Jiebo LUO. Human Action Recognition via Spatio-temporal Dual Network Flow and Visual Attention Fusion[J]. Journal of Electronics & Information Technology, 2018, 40(10): 2395-2401. doi: 10.11999/JEIT171116

Citation:

Tianliang LIU, Qingwei QIAO, Junwei WAN, Xiubin DAI, Jiebo LUO. Human Action Recognition via Spatio-temporal Dual Network Flow and Visual Attention Fusion[J]. Journal of Electronics & Information Technology, 2018, 40(10): 2395-2401. doi: 10.11999/JEIT171116

Citation:

PDF( 1518 KB)

Human Action Recognition via Spatio-temporal Dual Network Flow and Visual Attention Fusion

doi: 10.11999/JEIT171116

1.
Jiangsu Provincial Key Laboratory of Image Processing and Image Communication, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
2.
Department of Computer Science, University of Rochester, Rochester, NY 14627, USA

Funds: The National Natural Science Foundation of China (61001152, 31200747, 61071091, 61071166, 61172118), The Natural Science Foundation of Jiangsu Provice of China (BK2012437), The Natural Science Foundation of NJUPT (NY214037), China Scholarship Council

Received Date: 2017-11-27
Rev Recd Date: 2018-07-26

Available Online: 2018-08-02

Publish Date: 2018-10-01

Abstract

Abstract

Inspired by the mechanism of human brain visual perception, an action recognition approach integrating dual spatio-temporal network flow and visual attention is proposed in a deep learning framework. First, the optical flow features with body motion are extracted frame-by-frame from video with coarse-to-fine Lucas-Kanade flow estimation. Then, the GoogLeNet neural network with fine-tuned pre-trained model is applied to convoluting layer-by-layer and aggregate respectively appearance images and the related optical flow features in the selected time window. Next, the multi-layered Long Short-Term Memory (LSTM) neural networks are exploited to cross-recursively perceive the spatio-temporal semantic feature sequences with high level and significant structure. Meanwhile, the inter-dependent implicit states are decoded in the given time window, and the attention salient feature sequence is obtained from temporal stream with the visual feature descriptor in spatial stream and the label probability of each frame. Then, the temporal attention confidence for each frame with respect to human actions is calculated with the relative entropy measure and fused with the probability distributions with respect to the action categories from the given spatial perception network stream in the video sequence. Finally, the softmax classifier is exploited to identify the category of human action in the given video sequence. Experimental results show that this presented approach has significant advantages in classification accuracy compared with other methods.
- Human action recognition,
- Optical flow,
- Spatio-temporal dual network flow,
- Visual attention,
- Convolution Neural Network (CNN),
- Long Short-Term Memory (LSTM)

FullText(HTML)

References(17)

References

IKIZLER-CINBIS N and SCLAROFF S, Object, scene and actions: Combining multiple features for human action recognition[C]. European Conference on Computer Vision, Heraklion, Crete, Greece, 2010, 6311: 494–507.

WANG Heng, KLASER A, and SCHMID C. Action recognition by dense trajectories[C]. IEEE Conference on Computer Vision and Pattern Recognition, Providence, USA, 2011: 3169–3176.

张良, 鲁梦梦, 姜华. 局部分布信息增强的视觉单词描述与动作识别[J]. 电子与信息学报, 2016, 38(3): 549–556 doi: 10.11999/JEIT150410

ZHANG Liang, LU Mengmeng, and JIANG Hua. An improved scheme of visual words description and action recognition using local enhanced distribution information[J]. Journal of Electronics&Information Technology, 2016, 38(3): 549–556 doi: 10.11999/JEIT150410

SHARMA S, KIROS R and SALAKHUTDINOV R. Action recognition using visual attention[C]. International Conference on Neural Information Processing Systems Times Series Workshop, Montreal, Canada, 2015: 1–11.

SCHMIDHUBER J. Deep learning in neural networks: An overview[J]. Neural Networks, 2015, 61: 85–1117 doi: 10.1016/j.neunet.2014.09.003

RENSINK R A. The dynamic representation of scenes[J]. Visual Cognition, 2000, 1(1/3): 17–42.

XU Kelvin, BA Jimmy, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]. Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015, 14: 77–81.

BAHDANAU D, CHO K, and BENGIO Y. Neural machine translation by jointly learning to align and translate[C]. International Conference on Learning Representation, San Diego, USA, 2015: 1–15.

MNIH V, HEESS N, GRAVES A, et al. Recurrent models of visual attention[C]. Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, 2014: 2204–2212.

BA Jimmy Lei, GROSSE R, SALAKHUTDINOV R, et al. Learning wake-sleep recurrent attention models[C]. International Conference on Neural Information Processing Systems, Montreal, Canada, 2015: 2593–2601.

AND J S P. Horn-Schunck optical flow with a multi-scale strategy[J]. Image Processing on Line, 2013, 20: 151–172 doi: 10.5201/ipol.2013.20

RUSSAKOVSKY O, DENG Jia, SU Hao, et al. ImageNet: Large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211–252.

SZEGEDY Christian, LIU Wei, JIA Yangqing, et al. Going deeper with convolutions[C]. IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 1–9.

ANDREJ K, JUSTIN J, and LI Feifei. Visualizing and understanding recurrent networks[C]. International Conference on Learning Representation Workshop, Caribe Hilton, USA, 2016: 1–11.

GOLDBERGER J, GORDON S, and GREENSPAN H. An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures[C]. IEEE International Conference on Computer Vision, Nice, France, 2003: 487–493.

SRIVASTAVA N, HINTON G E, KRIZHEVSKY A, et al. Dropout: A simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research, 2014, 15: 1929–1958.

KINGMA D P and BA J. Adam: A method for stochastic optimization[C]. International Conference on Learning Representation, San Diego, USA, 2015: 1–15.

Relative Articles

Supplements(0)

Cited By

Proportional views