高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于渐进式学习与多尺度增强的客体视觉注意力估计方法

丰江帆 何中鱼

丰江帆, 何中鱼. 基于渐进式学习与多尺度增强的客体视觉注意力估计方法[J]. 电子与信息学报, 2023, 45(4): 1475-1484. doi: 10.11999/JEIT220218
引用本文: 丰江帆, 何中鱼. 基于渐进式学习与多尺度增强的客体视觉注意力估计方法[J]. 电子与信息学报, 2023, 45(4): 1475-1484. doi: 10.11999/JEIT220218
FENG Jiangfan, HE Zhongyu. Objective Visual Attention Estimation Method via Progressive Learning and Multi-scale Enhancement[J]. Journal of Electronics & Information Technology, 2023, 45(4): 1475-1484. doi: 10.11999/JEIT220218
Citation: FENG Jiangfan, HE Zhongyu. Objective Visual Attention Estimation Method via Progressive Learning and Multi-scale Enhancement[J]. Journal of Electronics & Information Technology, 2023, 45(4): 1475-1484. doi: 10.11999/JEIT220218

基于渐进式学习与多尺度增强的客体视觉注意力估计方法

doi: 10.11999/JEIT220218 cstr: 32379.14.JEIT220218
基金项目: 国家自然科学基金 (41971365),重庆市自然科学基金(cstc2020jcyj-msxmX0635)
详细信息
    作者简介:

    丰江帆:男,教授,博士生导师,研究方向为空间信息智能处理与应用、计算机视觉

    何中鱼:男,硕士生,研究方向为计算机视觉

    通讯作者:

    丰江帆 fengjf@cqupt.edu.cn

  • 中图分类号: TN911.73; TP391

Objective Visual Attention Estimation Method via Progressive Learning and Multi-scale Enhancement

Funds: The National Natural Science Foundation of China (41971365), The Natural Science Foundation of Chongqing (cstc2020jcyj-msxmX0635)
  • 摘要: 视觉注意力机制已引起学界和产业界的广泛关注,但既有工作主要从场景观察者的视角进行注意力检测。然而,现实中不断涌现的智能应用场景需要从客体视角进行视觉注意力检测。例如,检测监控目标的视觉注意力有助于预测其后续行为,智能机器人需要理解交互对象的意图才能有效互动。该文结合客体视觉注意力的认知机制,提出一种基于渐进式学习与多尺度增强的客体视觉注意力估计方法。该方法把客体视域视为几何结构和几何细节的组合,构建层次自注意力模块(HSAM)获取深层特征之间的长距离依赖关系,适应几何特征的多样性;并利用方向向量和视域生成器得到注视点的概率分布,构建特征融合模块将多分辨率特征进行结构共享、融合与增强,更好地获取空间上下文特征;最后构建综合损失函数来估计注视方向、视域和焦点预测的相关性。实验结果表明,该文所提方法在公开数据集和自建数据集上对客体视觉注意力估计的不同精度评价指标都优于目前的主流方法。
  • 图  1  模型总体结构图

    图  2  视域生成示意图

    图  3  自适应多分辨特征融合模块

    图  4  AutoGaze数据集示例样本

    图  5  不同方法样例估计结果对比图

    表  1  不同变种方法在GazeFollow和AutoGaze数据集上的结果对比

    方法GazeFollowAutoGaze
    AUCDistAng (°)AUCDistAng (°)
    M10.9180.13517.30.9650.08615.6
    M20.9140.13917.30.9600.09316.0
    M30.9160.13616.80.9640.08714.7
    M40.9150.13717.00.9610.09115.4
    M50.9150.13817.10.9630.08914.4
    M60.9060.14317.60.9600.09216.6
    本文方法(全模块)0.9220.13316.70.9690.08313.9
    下载: 导出CSV

    表  2  不同模型在GazeFollow数据集上的结果对比

    方法AUCDistMinDistAng (°)MinAng (°)
    Random[7]0.5040.4840.39169.0
    Center[7]0.6330.3130.23049.0
    Fixed bias[7]0.6740.3060.21948.0
    Recasens等人[7]0.8780.1900.11324.0
    Chong等人[14]0.8960.1870.112
    Zhao等人[23]0.1470.08217.6
    Lian等人[8]0.9060.1450.08117.68.8
    Chong等人[11]0.9210.1370.077
    本文方法(FPN)0.9050.1460.08317.58.5
    本文方法(ResNet50)0.9150.1380.07517.18.1
    本文方法0.9220.1330.07216.77.6
    人工辨识0.9240.0960.04011.0
    下载: 导出CSV

    表  3  不同模型在AutoGaze数据集上的结果对比

    方法AUCDistAng (°)
    Random[7]0.5130.45365.8
    Center[7]0.5750.36450.9
    Fixed bias [7]0.6240.33448.5
    Recasens等人[7]0.9440.11320.8
    Chong等人[14]0.9550.108
    Zhao等人[23]0.10116.2
    Lian等人[8]0.9590.09515.8
    Chong等人[11]0.9660.087
    本文方法(FPN)0.9620.09214.4
    本文方法(ResNet50)0.9650.08615.8
    本文方法0.9690.08313.9
    人工辨识0.9730.0619.0
    下载: 导出CSV

    表  4  不同模型的参数量、大小和在AutoGaze数据集上的训练总时间对比

    方法参量数(M)模型大小(M)时间(h)
    Lian等人[8]51.7207.21.41
    Chong等人[11]61.5246.52.23
    本文方法(FPN)57.1228.71.47
    本文方法(ResNet50)54.3218.41.81
    本文方法59.8240.01.89
    下载: 导出CSV
  • [1] FATHI A, HODGINS J K, and REHG J M. Social interactions: A first-person perspective[C]. 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, USA, 2012: 1226–1233.
    [2] MARIN-JIMENEZ M J, ZISSERMAN A, EICHNER M, et al. Detecting people looking at each other in videos[J]. International Journal of Computer Vision, 2014, 106(3): 282–296. doi: 10.1007/s11263-013-0655-7
    [3] PARKS D, BORJI A, and ITTI L. Augmented saliency model using automatic 3D head pose detection and learned gaze following in natural scenes[J]. Vision Research, 2015, 116: 113–126. doi: 10.1016/j.visres.2014.10.027
    [4] SOO PARK H and SHI Jianbo. Social saliency prediction[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 4777–4785.
    [5] ZHANG Xucong, SUGANO Y, FRITZ M, et al. Appearance-based gaze estimation in the wild[C]. 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 4511–4520.
    [6] CHENG Yihua, LU Feng, and ZHANG Xucong. Appearance-based gaze estimation via evaluation-guided asymmetric regression[C]. The 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 105–121.
    [7] RECASENS A, KHOSLA A, VONDRICK C, et al. Where are they looking?[C]. The 28th International Conference on Neural Information Processing Systems, Montreal, Canada, 2015: 199–207.
    [8] LIAN Dongze, YU Zehao, and GAO Shenghua. Believe it or not, we know what you are looking at![C]. The 14th Asian Conference on Computer Vision, Perth, Australia, 2018: 35–50.
    [9] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
    [10] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, USA, 2017: 936–944.
    [11] CHONG E, WANG Yongxin, RUIZ N, et al. Detecting attended visual targets in video[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 5395–5405.
    [12] SHI Xingjian, CHEN Zhourong, WANG Hao, et al. Convolutional LSTM network: A machine learning approach for precipitation nowcasting[C]. The 28th International Conference on Neural Information Processing Systems, Montreal, Canada, 2015: 802–810.
    [13] AUNG A M, RAMAKRISHNAN A, and WHITEHILL J R. Who are they looking at? Automatic eye gaze following for classroom observation video analysis[C]. The 11th International Conference on Educational Data Mining, Buffalo, USA, 2018: 252–258.
    [14] CHONG E, RUIZ N, WANG Yongxin, et al. Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 397–412.
    [15] WANG Jingdong, SUN Ke, CHENG Tianheng, et al. Deep high-resolution representation learning for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(10): 3349–3364. doi: 10.1109/TPAMI.2020.2983686
    [16] LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 9992–10002.
    [17] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C/OL]. The 9th International Conference on Learning Representations, 2021.
    [18] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 140.
    [19] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Los Angeles, USA, 2017: 6000–6010.
    [20] QIN Zequn, ZHANG Pengyi, WU Fei, et al. FcaNet: Frequency channel attention networks[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 763–772.
    [21] LIU Songtao, HUANG Di, and WANG Yunhong. Learning spatial fusion for single-shot object detection[EB/OL]. https://arxiv.org/abs/1911.09516, 2019.
    [22] WANG Guangrun, WANG Keze, and LIN Liang. Adaptively connected neural networks[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 1781–1790.
    [23] ZHAO Hao, LU Ming, YAO Anbang, et al. Learning to draw sight lines[J]. International Journal of Computer Vision, 2020, 128(5): 1076–1100. doi: 10.1007/s11263-019-01263-4
  • 加载中
图(5) / 表(4)
计量
  • 文章访问数:  664
  • HTML全文浏览量:  525
  • PDF下载量:  95
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-03-02
  • 修回日期:  2022-08-26
  • 录用日期:  2022-09-06
  • 网络出版日期:  2022-09-08
  • 刊出日期:  2023-04-10

目录

    /

    返回文章
    返回