高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

自适应卷积注意力与掩码结构协同的显著目标检测

朱磊 袁金垚 王文武 蔡小嫚

朱磊, 袁金垚, 王文武, 蔡小嫚. 自适应卷积注意力与掩码结构协同的显著目标检测[J]. 电子与信息学报. doi: 10.11999/JEIT240431
引用本文: 朱磊, 袁金垚, 王文武, 蔡小嫚. 自适应卷积注意力与掩码结构协同的显著目标检测[J]. 电子与信息学报. doi: 10.11999/JEIT240431
ZHU Lei, YUAN Jinyao, WANG Wenwu, CAI Xiaoman. Saliency Object Detection Utilizing Adaptive Convolutional Attention and Mask Structure[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT240431
Citation: ZHU Lei, YUAN Jinyao, WANG Wenwu, CAI Xiaoman. Saliency Object Detection Utilizing Adaptive Convolutional Attention and Mask Structure[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT240431

自适应卷积注意力与掩码结构协同的显著目标检测

doi: 10.11999/JEIT240431
详细信息
    作者简介:

    朱磊:男,副教授,研究方向为目标检测与识别、语义分割、场景解析等

    袁金垚:女,硕士生,研究方向为深度学习、语义分割

    王文武:男,副教授,主要研究方向为目标检测与识别、语义分割、场景解析等

    蔡小嫚:女,硕士生,研究方向为深度学习、语义分割

    通讯作者:

    袁金垚 jyyuan202209@163.com

  • 中图分类号: TN911.7; TP391

Saliency Object Detection Utilizing Adaptive Convolutional Attention and Mask Structure

  • 摘要: 显著目标检测(SOD)旨在模仿人类视觉系统注意力机制和认知机制来自动提取场景中的显著物体。虽然现有基于卷积神经网络 (CNN)或Transformer的模型不断刷新该领域方法的性能,但较少研究关注以下两个问题:(1)此领域多数方法常采用逐像素点的密集预测方式以获取像素显著值,然而该方式不符合基于人类视觉系统的场景解析机制,即人眼通常对语义区域进行整体分析而非关注像素级信息;(2)增强上下文信息关联在SOD任务中受到广泛关注,但通过Transformer主干结构获取长程关联特征不一定具有优势。SOD应更关注目标在适当区域内其中心-邻域差异性而非全局长程依赖。针对上述问题,该文提出一种新的显著目标检测模型,将CNN形式的自适应注意力和掩码注意力集成到网络中,以提高显著目标检测的性能。该算法设计了基于掩码感知的解码模块,通过将交叉注意力限制在预测的掩码区域来感知图像特征,有助于网络更好地聚焦于显著目标的整体区域。同时,该文设计了基于卷积注意力的上下文特征增强模块,与Transformer逐层建立长程关系不同,该模块仅捕获最高层特征中的适当上下文关联,避免引入无关的全局信息。该文在4个广泛使用的数据集上进行了实验评估,结果表明,该文提出的方法在不同场景下均取得了显著的性能提升,具有良好的泛化能力和稳定性。
  • 图  1  本文方法总体网络结构框图

    图  2  基于卷积视觉转换器的特征增强模块(CTFE)

    图  3  两种特征融合方法对比

    图  4  掩码感知视觉转换器模块

    图  5  本文方法与其他几种方法的定性评价结果

    图  6  特征可视化结果图

    表  1  所有参与评价方法在4个数据集上的Max F-measure, MAE测度的定量评价结果

    Methods (Years) Speed (fps) SOD ECSSD DUTS-TE DUT-OMRON
    MAE↓ $ {F}_{\beta }^{\mathrm{m}\mathrm{a}\mathrm{x}} $↑ MAE↓ $ {F}_{\beta }^{\mathrm{m}\mathrm{a}\mathrm{x}} $↑ MAE↓ $ {F}_{\beta }^{\mathrm{m}\mathrm{a}\mathrm{x}} $↑ MAE↓ $ {F}_{\beta }^{\mathrm{m}\mathrm{a}\mathrm{x}} $↑
    EGNet(2019) 30.5 0.0969 0.8778 0.0374 0.9474 0.0386 0.8880 0.0528 0.8155
    PoolNet(2019) 32.0 0.1000 0.8690 0.0390 0.9440 0.0400 0.8860 0.0560 0.8300
    MINet(2020) 86.1 0.0920 0.8680 0.0342 0.9475 0.0373 0.8833 0.0559 0.8098
    AADFNet(2022) 15.0 0.0903 0.8677 0.0280 0.9543 0.0314 0.8993 0.0488 0.8143
    SACNet(2021) 11.2 0.0934 0.8804 0.0309 0.9512 0.0339 0.8944 0.0523 0.8287
    ICON(2022) 58.5 0.0841 0.8790 0.0318 0.9503 0.0370 0.8917 0.0569 0.8254
    MENet(2023) 45.0 0.0874 0.8780 0.0307 0.9549 0.0281 0.9123 0.0380 0.8337
    VSCode(2024) 39.8 0.0602 0.8817 0.0245 0.9560 0.0262 0.9150 0.0473 0.8315
    ours 46.0 0.0567 0.8872 0.0230 0.9508 0.0243 0.8966 0.0352 0.8290
    下载: 导出CSV

    表  2  不同模块的定量消融实验结果

    Experiments Methods SOD
    MAE↓ $ {F}_{\beta }^{\mathrm{m}\mathrm{a}\mathrm{x}} $↑
    a Baseline 0.109 1 0.869 6
    b Baseline+CTFE 0.102 0 0.875 5
    c Baseline+CTFE+MAT 0.056 7 0.887 2
    d Baseline+CTFE+MAT+
    Canny Loss
    0.058 0 0.885 3
    e Baseline+CTFE+MAT+
    IOU_BCE Loss
    0.056 7 0.887 2
    f Attention-Fusion 0.064 7 0.876 1
    g Simple-Fusion 0.056 7 0.887 2
    下载: 导出CSV

    表  3  不同损失比重的实验结果

    Weight of Loss SOD
    ${L_{{\text{mask}}}}$ ${L_{{\text{rank}}}}$ ${L_{{\text{edge}}}}$ MAE↓ $ {F}_{\beta }^{\mathrm{m}\mathrm{a}\mathrm{x}} $↑
    1 0.5 0.5 0.060 0 0.883 3
    0.5 1 0.5 0.058 9 0.873 5
    0.5 0.5 1 0.073 5 0.871 4
    1 1 1 0.056 7 0.887 2
    下载: 导出CSV
  • [1] ZHOU Huajun, XIE Xiaohua, LAI Jianhuang, et al. Interactive two-stream decoder for accurate and fast saliency detection[C]. Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 9138–9147. doi: 10.1109/CVPR42600.2020.00916.
    [2] LIANG Pengpeng, PANG Yu, LIAO Chunyuan, et al. Adaptive objectness for object tracking[J]. IEEE Signal Processing Letters, 2016, 23(7): 949–953. doi: 10.1109/LSP.2016.2556706.
    [3] RUTISHAUSER U, WALTHER D, KOCH C, et al. Is bottom-up attention useful for object recognition?[C]. Proceedings of 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, USA, 2004: II-II. doi: 10.1109/CVPR.2004.1315142.
    [4] ZHANG Jing, FAN Dengping, DAI Yuchao, et al. RGB-D saliency detection via cascaded mutual information minimization[C]. Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 4318–4327. doi: 10.1109/ICCV48922.2021.00430.
    [5] LI Aixuan, MAO Yuxin, ZHANG Jing, et al. Mutual information regularization for weakly-supervised RGB-D salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(1): 397–410. doi: 10.1109/TCSVT.2023.3285249.
    [6] LIAO Guibiao, GAO Wei, LI Ge, et al. Cross-collaborative fusion-encoder network for robust RGB-thermal salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(11): 7646–7661. doi: 10.1109/TCSVT.2022.3184840.
    [7] CHEN Yilei, Li Gongyang, AN Ping, et al. Light field salient object detection with sparse views via complementary and discriminative interaction network[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(2): 1070–1085. doi: 10.1109/TCSVT.2023.3290600.
    [8] ITTI L, KOCH C, and NIEBUR E. A model of saliency-based visual attention for rapid scene analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(11): 1254–1259. doi: 10.1109/34.730558.
    [9] JIANG Huaizu, WANG Jingdong, YUAN Zejian, et al. Salient object detection: A discriminative regional feature integration approach[C]. Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, USA, 2013: 2083–2090. doi: 10.1109/CVPR.2013.271.
    [10] LI Guanbin and YU Yizhou. Visual saliency based on multiscale deep features[C]. Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 5455–5463. doi: 10.1109/CVPR.2015.7299184.
    [11] LEE G, TAI Y W, and KIM J. Deep saliency with encoded low level distance map and high level features[C]. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 660–668. doi: 10.1109/CVPR.2016.78.
    [12] WANG Linzhao, WANG Lijun, LU Huchuan, et al. Salient object detection with recurrent fully convolutional networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(7): 1734–1746. doi: 10.1109/TPAMI.2018.2846598.
    [13] LIU Nian, ZHANG Ni, WAN Kaiyuan, et al. Visual saliency transformer[C]. Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 4702–4712. doi: 10.1109/ICCV48922.2021.00468.
    [14] YUN Yike and LIN Weisi. SelfReformer: Self-refined network with transformer for salient object detection[J]. arXiv: 2205.11283, 2022.
    [15] ZHU Lei, CHEN Jiaxing, HU Xiaowei, et al. Aggregating attentional dilated features for salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(10): 3358–3371. doi: 10.1109/TCSVT.2019.2941017.
    [16] XIE Enze, WANG Wenhai, YU Zhiding, et al. SegFormer: Simple and efficient design for semantic segmentation with transformers[C]. Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021: 924.
    [17] WANG Libo, LI Rui, ZHANG Ce, et al. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2022, 190: 196–214. doi: 10.1016/j.isprsjprs.2022.06.008.
    [18] ZHOU Daquan, KANG Bingyi, JIN Xiaojie, et al. DeepViT: Towards deeper vision transformer[J]. arXiv: 2103.11886, 2021.
    [19] GAO Shanghua, CHENG Mingming, ZHAO Kai, et al. Res2Net: A new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 652–662. doi: 10.1109/TPAMI.2019.2938758.
    [20] LIN Xian, YAN Zengqiang, DENG Xianbo, et al. ConvFormer: Plug-and-play CNN-style transformers for improving medical image segmentation[C]. Proceedings of the 26th International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, Canada, 2023: 642–651. doi: 10.1007/978-3-031-43901-8_61.
    [21] CHENG Bowen, MISRA I, SCHWING A G, et al. Masked-attention mask transformer for universal image segmentation[C]. Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 1280–1289. doi: 10.1109/CVPR52688.2022.00135.
    [22] ZHAO Jiaxing, LIU Jiangjiang, FAN Dengping, et al. EGNet: Edge guidance network for salient object detection[C]. Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 8778–8787. doi: 10.1109/ICCV.2019.00887.
    [23] LIU Jiangjiang, HOU Qibin, CHENG Mingming, et al. A simple pooling-based design for real-time salient object detection[C]. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3912–3921. doi: 10.1109/CVPR.2019.00404.
    [24] PANG Youwei, ZHAO Xiaoqi, ZHANG Lihe, et al. Multi-scale interactive network for salient object detection[C]. Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 9410–9419. doi: 10.1109/CVPR42600.2020.00943.
    [25] HU Xiaowei, FU C, ZHU Lei, et al. SAC-Net: Spatial attenuation context for salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(3): 1079–1090. doi: 10.1109/TCSVT.2020.2995220.
    [26] ZHUGE Mingchen, FAN Dengping, LIU Nian, et al. Salient object detection via integrity learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3738–3752. doi: 10.1109/TPAMI.2022.3179526.
    [27] WANG Yi, WANG Ruili, FAN Xin, et al. Pixels, regions, and objects: Multiple enhancement for salient object detection[C]. Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 10031–10040. doi: 10.1109/CVPR52729.2023.00967.
    [28] LUO Ziyang, LIU Nian, ZHAO Wangbo, et al. VSCode: General visual salient and camouflaged object detection with 2D prompt learning[C]. Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 17169–17180. doi: 10.1109/CVPR52733.2024.01625.
  • 加载中
图(6) / 表(3)
计量
  • 文章访问数:  67
  • HTML全文浏览量:  19
  • PDF下载量:  17
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-05-13
  • 修回日期:  2024-09-18
  • 网络出版日期:  2024-09-24

目录

    /

    返回文章
    返回