Objective Visual Attention Estimation Method via Progressive Learning and Multi-scale Enhancement

FENG Jiangfan; HE Zhongyu

doi:10.11999/JEIT220218

Volume 45 Issue 4

Apr. 2023

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2023 > 45(4): 1475-1484

Ding Yao-Gen, Zhu Yun-Shu. THE MODIFICATION OF THE METHOD FOR DESIGNING THE FILTER TYPE OUTPUT CIRCUIT OF BROADBAND KLYSTRONS AND THE CALCULATION OF GAP INTERACTION IMPEDANCE[J]. Journal of Electronics & Information Technology, 1982, 4(6): 354-364.

Citation:

FENG Jiangfan, HE Zhongyu. Objective Visual Attention Estimation Method via Progressive Learning and Multi-scale Enhancement[J]. Journal of Electronics & Information Technology, 2023, 45(4): 1475-1484. doi: 10.11999/JEIT220218

Citation:

PDF( 9332 KB)

Objective Visual Attention Estimation Method via Progressive Learning and Multi-scale Enhancement

doi: 10.11999/JEIT220218

FENG Jiangfan^,,
HE Zhongyu

School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

Funds: The National Natural Science Foundation of China (41971365), The Natural Science Foundation of Chongqing (cstc2020jcyj-msxmX0635)

Received Date: 2022-03-02
Accepted Date: 2022-09-06
Rev Recd Date: 2022-08-26

Available Online: 2022-09-08

Publish Date: 2023-04-10

Abstract

Abstract

Understanding the attention mechanism of the human visual system has attracted much research attention from researchers and industries. Recent studies of attention mechanisms focus mainly on observer patterns. However, more intelligent applications are presented in the real world and require objective visual attention detection. Automating tasks such as surveillance or human-robot collaboration require anticipating and predicting the behavior of objects. In such contexts, gaze and focus can be highly informative about participants' intentions, goals, and upcoming decisions. Here, a progressive mechanism of objective visual attention is developed by combining cognitive mechanisms. The field is first viewed as a combination of geometric structure and geometric details. A Hierarchical Self-Attention Module (HSAM) is constructed to capture the long-distance dependencies between deep features and adapt geometric feature diversity. With the identified generators, the field of view direction vectors are generated, and the probability distribution of gaze points is obtained. Furthermore, a feature fusion module is designed for structure sharing, fusion, and enhancement of multi-resolution features. Its output contains more detailed spatial and global information, better obtaining spatial context features. The experimental results are in excellent agreement with theoretical predictions by different evaluation metrics for objective attention estimation on publicly available and self-built datasets.
- Objective visual attention,
- Progressive learning,
- Hierarchical self-attention,
- Feature fusion

FullText(HTML)

References(23)

References

[1]	FATHI A, HODGINS J K, and REHG J M. Social interactions: A first-person perspective[C]. 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, USA, 2012: 1226–1233.
[2]	MARIN-JIMENEZ M J, ZISSERMAN A, EICHNER M, et al. Detecting people looking at each other in videos[J]. International Journal of Computer Vision, 2014, 106(3): 282–296. doi: 10.1007/s11263-013-0655-7
[3]	PARKS D, BORJI A, and ITTI L. Augmented saliency model using automatic 3D head pose detection and learned gaze following in natural scenes[J]. Vision Research, 2015, 116: 113–126. doi: 10.1016/j.visres.2014.10.027
[4]	SOO PARK H and SHI Jianbo. Social saliency prediction[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 4777–4785.
[5]	ZHANG Xucong, SUGANO Y, FRITZ M, et al. Appearance-based gaze estimation in the wild[C]. 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 4511–4520.
[6]	CHENG Yihua, LU Feng, and ZHANG Xucong. Appearance-based gaze estimation via evaluation-guided asymmetric regression[C]. The 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 105–121.
[7]	RECASENS A, KHOSLA A, VONDRICK C, et al. Where are they looking?[C]. The 28th International Conference on Neural Information Processing Systems, Montreal, Canada, 2015: 199–207.
[8]	LIAN Dongze, YU Zehao, and GAO Shenghua. Believe it or not, we know what you are looking at![C]. The 14th Asian Conference on Computer Vision, Perth, Australia, 2018: 35–50.
[9]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
[10]	LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, USA, 2017: 936–944.
[11]	CHONG E, WANG Yongxin, RUIZ N, et al. Detecting attended visual targets in video[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 5395–5405.
[12]	SHI Xingjian, CHEN Zhourong, WANG Hao, et al. Convolutional LSTM network: A machine learning approach for precipitation nowcasting[C]. The 28th International Conference on Neural Information Processing Systems, Montreal, Canada, 2015: 802–810.
[13]	AUNG A M, RAMAKRISHNAN A, and WHITEHILL J R. Who are they looking at? Automatic eye gaze following for classroom observation video analysis[C]. The 11th International Conference on Educational Data Mining, Buffalo, USA, 2018: 252–258.
[14]	CHONG E, RUIZ N, WANG Yongxin, et al. Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 397–412.
[15]	WANG Jingdong, SUN Ke, CHENG Tianheng, et al. Deep high-resolution representation learning for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(10): 3349–3364. doi: 10.1109/TPAMI.2020.2983686
[16]	LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 9992–10002.
[17]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C/OL]. The 9th International Conference on Learning Representations, 2021.
[18]	RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 140.
[19]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Los Angeles, USA, 2017: 6000–6010.
[20]	QIN Zequn, ZHANG Pengyi, WU Fei, et al. FcaNet: Frequency channel attention networks[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 763–772.
[21]	LIU Songtao, HUANG Di, and WANG Yunhong. Learning spatial fusion for single-shot object detection[EB/OL]. https://arxiv.org/abs/1911.09516, 2019.
[22]	WANG Guangrun, WANG Keze, and LIN Liang. Adaptively connected neural networks[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 1781–1790.
[23]	ZHAO Hao, LU Ming, YAO Anbang, et al. Learning to draw sight lines[J]. International Journal of Computer Vision, 2020, 128(5): 1076–1100. doi: 10.1007/s11263-019-01263-4