Saliency Object Detection Utilizing Adaptive Convolutional Attention and Mask Structure

ZHU Lei; YUAN Jinyao; WANG Wenwu; CAI Xiaoman

doi:10.11999/JEIT240431

Volume 47 Issue 1

Jan. 2025

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 > 47(1): 260-270

Yongjun XU, Bowen GU, Yang YANG, Cuixian WU, Qianbin CHEN, Guangyue LU. Robust Energy-efficient Resource Allocation Algorithm in D2D Communication Networks with Imperfect CSI[J]. Journal of Electronics & Information Technology, 2021, 43(8): 2189-2198. doi: 10.11999/JEIT200587

Citation:

ZHU Lei, YUAN Jinyao, WANG Wenwu, CAI Xiaoman. Saliency Object Detection Utilizing Adaptive Convolutional Attention and Mask Structure[J]. Journal of Electronics & Information Technology, 2025, 47(1): 260-270. doi: 10.11999/JEIT240431

Citation:

PDF( 2990 KB)

Saliency Object Detection Utilizing Adaptive Convolutional Attention and Mask Structure

doi: 10.11999/JEIT240431

School of Information Science and Engineering, Wuhan University of Science and Technology, Wuhan 430000, China

Received Date: 2024-05-13
Rev Recd Date: 2024-09-18

Available Online: 2024-09-24

Publish Date: 2025-01-31

Abstract

Abstract

Objective Salient Object Detection (SOD) aims to replicate the human visual system’s attentional processes by identifying visually prominent objects within a scene. Recent advancements in Convolutional Neural Networks (CNNs) and Transformer-based models have improved performance; however, several limitations remain: (1) Most existing models depend on pixel-wise dense predictions, diverging from the human visual system’s focus on region-level analysis, which can result in inconsistent saliency distribution within semantic regions. (2) The common application of Transformers to capture global dependencies may not be ideal for SOD, as the task prioritizes center-surround contrasts in local areas rather than global long-range correlations. This study proposes an innovative SOD model that integrates CNN-style adaptive attention and mask-aware mechanisms to enhance contextual feature representation and overall performance. Methods The proposed model architecture comprises a feature extraction backbone, contextual enhancement modules, and a mask-aware decoding structure. A CNN backbone, specifically Res2Net, is employed for extracting multi-scale features from input images. These features are processed hierarchically to preserve both spatial detail and semantic richness. Additionally, this framework utilizes a top-down pathway with feature pyramids to enhance multi-scale representations. High-level features are further refined through specialized modules to improve saliency prediction. Central to this architecture is the ConvoluTional attention-based contextual Feature Enhancement (CTFE) module. By using adaptive convolutional attention, this module effectively captures meaningful contextual associations without relying on global dependencies, as seen in Transformer-based methods. The CTFE focuses on modeling center-surround contrasts within relevant regions, avoiding unnecessary computational overhead. Features refined by the CTFE module are integrated with lower-level features through the Feature Fusion Module (FFM). Two fusion strategiesAttention-Fusion and Simple-Fusion—were evaluated to identify the most effective method for merging hierarchical features. The decoding process is managed by the Mask-Aware Transformer (MAT) module, which predicts salient regions by restricting attention to mask-defined areas. This strategy ensures that the decoding process prioritizes regions relevant to saliency, enhancing semantic consistency while reducing noise from irrelevant background information. The MAT module’s ability to generate both masks and object confidence scores makes it particularly suited for complex scenes. Multiple loss functions guide the training process: Mask loss, computed using Dice loss, ensures that predicted masks closely align with ground truth. Ranking loss prioritizes the significance of salient regions, while edge loss sharpens boundaries to clearly distinguish salient objects from their background. These objectives are optimized jointly using the Adam optimizer with a dynamically adjusted learning rate. Results and Discussions Experiments were conducted using the PyTorch framework on an RTX 3090 GPU, with training configurations optimized for SOD datasets. The input resolution was set to 384×384 pixels, and data augmentation techniques, such as horizontal flipping and random cropping, were applied. The learning rate was initialized at 6e–6 and adjusted dynamically, with the Adam optimizer employed to minimize the combined loss functions. Experimental evaluations were performed on four widely used datasets: SOD, DUTS-TE, DUT-OMRON, and ECSSD. The proposed model demonstrated exceptional performance across all datasets, showing significant improvements in Mean Absolute Error (MAE) and maximum F-measure metrics. For instance, on the DUTS-TE dataset, the model achieved an MAE of 0.023 and a maximum F-measure of 0.9508, exceeding competing methods such as MENet and VSCode. Visual comparisons indicate that the proposed method generates saliency maps that closely align with the ground truth, effectively addressing challenging scenarios including fine structures, multiple objects, and complex backgrounds. In contrast, other methods often incorporate irrelevant regions or fail to accurately capture object details. Ablation experiments validated the effectiveness of crucial components. For example, the incorporation of the CTFE module resulted in a reduction of MAE from 0.109 to 0.102. Additionally, the Simple-Fusion strategy outperformed the Attention-Fusion approach, yielding a lower MAE and a higher maximum F-measure score. The integration of IOU and BCE-based edge loss further enhanced boundary sharpness, demonstrating superior performance compared to Canny-based edge loss. Heatmaps illustrate the contributions of the CTFE and MAT modules in emphasizing salient regions while preserving semantic consistency. The CTFE effectively accentuates center-surround contrasts, while the MAT captures global object-level semantics. These visualizations highlight the model’s ability to focus on critical areas while minimizing background noise. Conclusions This study presents a novel SOD framework that integrates CNN-style adaptive attention with mask-aware decoding mechanisms. The proposed model addresses the limitations of existing approaches by enhancing semantic consistency and contextual representation while avoiding excessive dependence on global variables. Comprehensive evaluations demonstrate its robustness, generalization capability, and significant performance enhancements across multiple benchmarks. Future research will investigate further optimization of the architecture and its application to multimodal SOD tasks, including RGB-D and RGB-T saliency detection.
- Saliency Object Detection (SOD),
- Convolutional Neural Network (CNN)-style adaptive attention,
- Mask attention,
- Feature fusion

FullText(HTML)

References(28)

References

[1]	ZHOU Huajun, XIE Xiaohua, LAI Jianhuang, et al. Interactive two-stream decoder for accurate and fast saliency detection[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 9138–9147. doi: 10.1109/CVPR42600.2020.00916.
[2]	LIANG Pengpeng, PANG Yu, LIAO Chunyuan, et al. Adaptive objectness for object tracking[J]. IEEE Signal Processing Letters, 2016, 23(7): 949–953. doi: 10.1109/LSP.2016.2556706.
[3]	RUTISHAUSER U, WALTHER D, KOCH C, et al. Is bottom-up attention useful for object recognition?[C]. 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, USA, 2004: II-II. doi: 10.1109/CVPR.2004.1315142.
[4]	ZHANG Jing, FAN Dengping, DAI Yuchao, et al. RGB-D saliency detection via cascaded mutual information minimization[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 4318–4327. doi: 10.1109/ICCV48922.2021.00430.
[5]	LI Aixuan, MAO Yuxin, ZHANG Jing, et al. Mutual information regularization for weakly-supervised RGB-D salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(1): 397–410. doi: 10.1109/TCSVT.2023.3285249.
[6]	LIAO Guibiao, GAO Wei, LI Ge, et al. Cross-collaborative fusion-encoder network for robust RGB-thermal salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(11): 7646–7661. doi: 10.1109/TCSVT.2022.3184840.
[7]	CHEN Yilei, Li Gongyang, AN Ping, et al. Light field salient object detection with sparse views via complementary and discriminative interaction network[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(2): 1070–1085. doi: 10.1109/TCSVT.2023.3290600.
[8]	ITTI L, KOCH C, and NIEBUR E. A model of saliency-based visual attention for rapid scene analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(11): 1254–1259. doi: 10.1109/34.730558.
[9]	JIANG Huaizu, WANG Jingdong, YUAN Zejian, et al. Salient object detection: A discriminative regional feature integration approach[C]. 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, USA, 2013: 2083–2090. doi: 10.1109/CVPR.2013.271.
[10]	LI Guanbin and YU Yizhou. Visual saliency based on multiscale deep features[C]. 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 5455–5463. doi: 10.1109/CVPR.2015.7299184.
[11]	LEE G, TAI Y W, and KIM J. Deep saliency with encoded low level distance map and high level features[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 660–668. doi: 10.1109/CVPR.2016.78.
[12]	WANG Linzhao, WANG Lijun, LU Huchuan, et al. Salient object detection with recurrent fully convolutional networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(7): 1734–1746. doi: 10.1109/TPAMI.2018.2846598.
[13]	LIU Nian, ZHANG Ni, WAN Kaiyuan, et al. Visual saliency transformer[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 4702–4712. doi: 10.1109/ICCV48922.2021.00468.
[14]	YUN Yike and LIN Weisi. SelfReformer: Self-refined network with transformer for salient object detection[J]. arXiv: 2205.11283, 2022.
[15]	ZHU Lei, CHEN Jiaxing, HU Xiaowei, et al. Aggregating attentional dilated features for salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(10): 3358–3371. doi: 10.1109/TCSVT.2019.2941017.
[16]	XIE Enze, WANG Wenhai, YU Zhiding, et al. SegFormer: Simple and efficient design for semantic segmentation with transformers[C]. The 35th International Conference on Neural Information Processing Systems, 2021: 924.
[17]	WANG Libo, LI Rui, ZHANG Ce, et al. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2022, 190: 196–214. doi: 10.1016/j.isprsjprs.2022.06.008.
[18]	ZHOU Daquan, KANG Bingyi, JIN Xiaojie, et al. DeepViT: Towards deeper vision transformer[J]. arXiv: 2103.11886, 2021.
[19]	GAO Shanghua, CHENG Mingming, ZHAO Kai, et al. Res2Net: A new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 652–662. doi: 10.1109/TPAMI.2019.2938758.
[20]	LIN Xian, YAN Zengqiang, DENG Xianbo, et al. ConvFormer: Plug-and-play CNN-style transformers for improving medical image segmentation[C]. The 26th International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, Canada, 2023: 642–651. doi: 10.1007/978-3-031-43901-8_61.
[21]	CHENG Bowen, MISRA I, SCHWING A G, et al. Masked-attention mask transformer for universal image segmentation[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 1280–1289. doi: 10.1109/CVPR52688.2022.00135.
[22]	ZHAO Jiaxing, LIU Jiangjiang, FAN Dengping, et al. EGNet: Edge guidance network for salient object detection[C]. 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 8778–8787. doi: 10.1109/ICCV.2019.00887.
[23]	LIU Jiangjiang, HOU Qibin, CHENG Mingming, et al. A simple pooling-based design for real-time salient object detection[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3912–3921. doi: 10.1109/CVPR.2019.00404.
[24]	PANG Youwei, ZHAO Xiaoqi, ZHANG Lihe, et al. Multi-scale interactive network for salient object detection[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 9410–9419. doi: 10.1109/CVPR42600.2020.00943.
[25]	HU Xiaowei, FU C, ZHU Lei, et al. SAC-Net: Spatial attenuation context for salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(3): 1079–1090. doi: 10.1109/TCSVT.2020.2995220.
[26]	ZHUGE Mingchen, FAN Dengping, LIU Nian, et al. Salient object detection via integrity learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3738–3752. doi: 10.1109/TPAMI.2022.3179526.
[27]	WANG Yi, WANG Ruili, FAN Xin, et al. Pixels, regions, and objects: Multiple enhancement for salient object detection[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 10031–10040. doi: 10.1109/CVPR527292023.00967.
[28]	LUO Ziyang, LIU Nian, ZHAO Wangbo, et al. VSCode: General visual salient and camouflaged object detection with 2D prompt learning[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 17169–17180. doi: 10.1109/CVPR52733.2024.01625.

Relative Articles

Supplements(0)

Cited By

Cited by

Periodical cited type(7)

1.	赵荣阳，郭哲，王明娟，杨忠强. 一种基于弱感知通信芯片的移动目标定位追踪方法. 物联网技术. 2024(07): 8-11+14 .
2.	胡久松，孙英杰，黄晓峰，谷志茹，李浩. 基于AAPC、CS与卡尔曼滤波的WiFi室内定位跟踪算法. 湖南工业大学学报. 2024(06): 71-78 .
3.	李媛. 基于边缘计算的在线学习资源压缩存储方法研究. 宁夏师范学院学报. 2022(01): 76-83 .
4.	张向阳，彭志豪，靳昊玥，侯钰慧，王帅，王雄. 基于LoRa与Socket的建筑能耗异构数据融合方法. 现代电子技术. 2022(06): 158-162 .
5.	宋明智，钱建生. 井下WLAN位置指纹样本自相关滤波降噪方法研究. 计算机应用研究. 2021(01): 179-183+189 .
6.	苏莉娜. 基于分布式数据库的大数据平台动态页面数据生成技术. 微型电脑应用. 2021(06): 194-197 .
7.	赵小虎，王刚，宋泊明，于嘉成. 基于压缩感知的设备多源信息传输与分类算法. 通信学报. 2020(02): 13-24 .