Advanced Search
Volume 45 Issue 4
Apr.  2023
Turn off MathJax
Article Contents
FENG Jiangfan, HE Zhongyu. Objective Visual Attention Estimation Method via Progressive Learning and Multi-scale Enhancement[J]. Journal of Electronics & Information Technology, 2023, 45(4): 1475-1484. doi: 10.11999/JEIT220218
Citation: FENG Jiangfan, HE Zhongyu. Objective Visual Attention Estimation Method via Progressive Learning and Multi-scale Enhancement[J]. Journal of Electronics & Information Technology, 2023, 45(4): 1475-1484. doi: 10.11999/JEIT220218

Objective Visual Attention Estimation Method via Progressive Learning and Multi-scale Enhancement

doi: 10.11999/JEIT220218
Funds:  The National Natural Science Foundation of China (41971365), The Natural Science Foundation of Chongqing (cstc2020jcyj-msxmX0635)
  • Received Date: 2022-03-02
  • Accepted Date: 2022-09-06
  • Rev Recd Date: 2022-08-26
  • Available Online: 2022-09-08
  • Publish Date: 2023-04-10
  • Understanding the attention mechanism of the human visual system has attracted much research attention from researchers and industries. Recent studies of attention mechanisms focus mainly on observer patterns. However, more intelligent applications are presented in the real world and require objective visual attention detection. Automating tasks such as surveillance or human-robot collaboration require anticipating and predicting the behavior of objects. In such contexts, gaze and focus can be highly informative about participants' intentions, goals, and upcoming decisions. Here, a progressive mechanism of objective visual attention is developed by combining cognitive mechanisms. The field is first viewed as a combination of geometric structure and geometric details. A Hierarchical Self-Attention Module (HSAM) is constructed to capture the long-distance dependencies between deep features and adapt geometric feature diversity. With the identified generators, the field of view direction vectors are generated, and the probability distribution of gaze points is obtained. Furthermore, a feature fusion module is designed for structure sharing, fusion, and enhancement of multi-resolution features. Its output contains more detailed spatial and global information, better obtaining spatial context features. The experimental results are in excellent agreement with theoretical predictions by different evaluation metrics for objective attention estimation on publicly available and self-built datasets.
  • loading
  • [1]
    FATHI A, HODGINS J K, and REHG J M. Social interactions: A first-person perspective[C]. 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, USA, 2012: 1226–1233.
    [2]
    MARIN-JIMENEZ M J, ZISSERMAN A, EICHNER M, et al. Detecting people looking at each other in videos[J]. International Journal of Computer Vision, 2014, 106(3): 282–296. doi: 10.1007/s11263-013-0655-7
    [3]
    PARKS D, BORJI A, and ITTI L. Augmented saliency model using automatic 3D head pose detection and learned gaze following in natural scenes[J]. Vision Research, 2015, 116: 113–126. doi: 10.1016/j.visres.2014.10.027
    [4]
    SOO PARK H and SHI Jianbo. Social saliency prediction[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 4777–4785.
    [5]
    ZHANG Xucong, SUGANO Y, FRITZ M, et al. Appearance-based gaze estimation in the wild[C]. 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 4511–4520.
    [6]
    CHENG Yihua, LU Feng, and ZHANG Xucong. Appearance-based gaze estimation via evaluation-guided asymmetric regression[C]. The 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 105–121.
    [7]
    RECASENS A, KHOSLA A, VONDRICK C, et al. Where are they looking?[C]. The 28th International Conference on Neural Information Processing Systems, Montreal, Canada, 2015: 199–207.
    [8]
    LIAN Dongze, YU Zehao, and GAO Shenghua. Believe it or not, we know what you are looking at![C]. The 14th Asian Conference on Computer Vision, Perth, Australia, 2018: 35–50.
    [9]
    HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
    [10]
    LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, USA, 2017: 936–944.
    [11]
    CHONG E, WANG Yongxin, RUIZ N, et al. Detecting attended visual targets in video[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 5395–5405.
    [12]
    SHI Xingjian, CHEN Zhourong, WANG Hao, et al. Convolutional LSTM network: A machine learning approach for precipitation nowcasting[C]. The 28th International Conference on Neural Information Processing Systems, Montreal, Canada, 2015: 802–810.
    [13]
    AUNG A M, RAMAKRISHNAN A, and WHITEHILL J R. Who are they looking at? Automatic eye gaze following for classroom observation video analysis[C]. The 11th International Conference on Educational Data Mining, Buffalo, USA, 2018: 252–258.
    [14]
    CHONG E, RUIZ N, WANG Yongxin, et al. Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 397–412.
    [15]
    WANG Jingdong, SUN Ke, CHENG Tianheng, et al. Deep high-resolution representation learning for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(10): 3349–3364. doi: 10.1109/TPAMI.2020.2983686
    [16]
    LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 9992–10002.
    [17]
    DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C/OL]. The 9th International Conference on Learning Representations, 2021.
    [18]
    RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 140.
    [19]
    VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Los Angeles, USA, 2017: 6000–6010.
    [20]
    QIN Zequn, ZHANG Pengyi, WU Fei, et al. FcaNet: Frequency channel attention networks[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 763–772.
    [21]
    LIU Songtao, HUANG Di, and WANG Yunhong. Learning spatial fusion for single-shot object detection[EB/OL]. https://arxiv.org/abs/1911.09516, 2019.
    [22]
    WANG Guangrun, WANG Keze, and LIN Liang. Adaptively connected neural networks[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 1781–1790.
    [23]
    ZHAO Hao, LU Ming, YAO Anbang, et al. Learning to draw sight lines[J]. International Journal of Computer Vision, 2020, 128(5): 1076–1100. doi: 10.1007/s11263-019-01263-4
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(5)  / Tables(4)

    Article Metrics

    Article views (341) PDF downloads(90) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return