FSG: Feature-level Semantic-aware Guidance for multi-modal Image Fusion Algorithm

ZHANG Mei; JIN Ye; ZHU Jinhui; HE Lin

doi:10.11999/JEIT250042

Volume 47 Issue 8

Aug. 2025

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 > 47(8): 2909-2918

ZHANG Mei, JIN Ye, ZHU Jinhui, HE Lin. FSG: Feature-level Semantic-aware Guidance for multi-modal Image Fusion Algorithm[J]. Journal of Electronics & Information Technology, 2025, 47(8): 2909-2918. doi: 10.11999/JEIT250042

Citation:

ZHANG Mei, JIN Ye, ZHU Jinhui, HE Lin. FSG: Feature-level Semantic-aware Guidance for multi-modal Image Fusion Algorithm[J]. Journal of Electronics & Information Technology, 2025, 47(8): 2909-2918. doi: 10.11999/JEIT250042

Citation:

PDF( 4340 KB)

FSG: Feature-level Semantic-aware Guidance for multi-modal Image Fusion Algorithm

doi: 10.11999/JEIT250042 cstr: 32379.14.JEIT250042

ZHANG Mei^{1, 2},
JIN Ye¹,
ZHU Jinhui^{3, 4
,
,},
HE Lin¹

1.
School of Automation Science and Engineering, South China University of Technology, Guangzhou 510000, China
2.
Key Laboratory of Autonomous Systems and Networked Control, Ministry of Education, South China University of Technology, Guangzhou 5106121, China
3.
School of Software Engineering, South China University of Technology, Guangzhou 510006, China
4.
Key Laboratory of Big Data and Intelligent Robotics, Ministry of Education, South China University of Technology, Guangzhou 5106121, China

Funds: The National Natural Science Foundation of China (62071184)

Received Date: 2025-01-16
Rev Recd Date: 2025-05-14

Available Online: 2025-05-22

Publish Date: 2025-08-27

Abstract

Abstract

Objective Multimodal vision techniques offer greater advantages than unimodal ones in autonomous driving scenarios. Fused images from multiple modalities enhance salient radiation information from targets while preserving background texture and detail. Furthermore, such fused images improve the performance of downstream visual tasks, i.e., semantic segmentation, compared with visible-light images alone, thereby enhancing the decision accuracy of automated driving systems. However, most existing fusion algorithms prioritize visual quality and standard evaluation metrics, often overlooking the requirements of downstream tasks. Although some approaches attempt to integrate task-specific guidance, they are constrained by weak interaction between semantic priors and fusion processes, and fail to address cross-modal feature variability. To address these limitations, this study proposes a multimodal image fusion algorithm, termed Feature-level Semantic-aware Guidance (FSG), which leverages feature-level semantic information from segmentation networks to guide the fusion process. The proposed method aims to enhance the utility of fused images in advanced vision tasks by strengthening the alignment between semantic understanding and feature integration. Methods The proposed algorithm adopts a parallel fusion framework integrating a fusion network and a segmentation network. Feature-level semantic prior knowledge from the segmentation network guides the fusion process, aiming to enhance the semantic richness of the fused image and improve performance in downstream visual tasks. The overall architecture comprises a fusion network, a segmentation network, and a feature interaction mechanism connecting the two. Infrared and visible images serve as inputs to the fusion network, whereas only visible images, which are rich in texture and detail, are used as inputs to the segmentation network. The fusion network uses a dual-branch structure for modality-specific feature extraction, with each branch containing two Adaptive Gabor convolution Residual (AGR) modules. A Multimodal Spatial Attention Fusion (MSAF) module is incorporated to effectively integrate features from different modalities. In the reconstruction phase, semantic features from the segmentation network are combined with image features from the fusion network via a Dual Feature Interaction (DFI) module, enhancing semantic representation before generating the final fused image. Results and Discussions This study includes fusion experiments and joint segmentation task experiments. For the fusion experiments, the proposed method is compared with seven state-of-the-art algorithms: DenseFuse, DIDFuse, U2Fusion, TarDal, SeAFusion, DIVFusion, and CDDFuse, across three datasets: MFNet, M3FD, and RoadScene. Both subjective and objective evaluations are conducted. For subjective evaluation, the fused images generated by each method are visually compared. For objective evaluation, six metrics are employed: Mutual Information (MI), Visual Information Fidelity (VIF), Average Gradient (AG), Sum of Correlation Differences (SCD), Structural Similarity Index Measure (SSIM), and Gradient-based Similarity Measurement (Q^AB/F). The results show that the proposed method performs consistently well across all datasets, effectively preserves complementary information from infrared and visible images, and achieves superior scores on all evaluation metrics. In the joint segmentation experiments, comparisons are made on the MFNet dataset. Subjective evaluation is presented through semantic segmentation visualizations, and objective evaluation uses Intersection over Union (IoU) and mean IoU (mIoU) metrics. The segmentation results produced by the proposed method more closely resemble ground truth labels and achieve the highest or second-highest IoU scores across all classes. Overall, the proposed method not only yields improved visual fusion results but also demonstrates clear advantages in downstream segmentation performance. Conclusions This study proposes an FSG strategy for multimodal image fusion networks, designed to fully leverage semantic information to improve the utility of fused images in downstream visual tasks. The method accounts for the variability among heterogeneous features and integrates the segmentation and fusion networks into a unified framework. By incorporating feature-level semantic information, the approach enhances the quality of the fused images and strengthens their performance in segmentation tasks. The proposed DFI module serves as a bridge between the segmentation and fusion networks, enabling effective interaction and selection of semantic and image features. This reduces the influence of feature variability and enriches the semantic content of the fusion results. In addition, the proposed MSAF module promotes the complementarity and integration of features from infrared and visible modalities while mitigating the disparity between them. Experimental results demonstrate that the proposed method not only achieves superior visual fusion quality but also outperforms existing methods in joint segmentation performance.
- Image fusion,
- Joint segmentation task,
- Semantic awareness,
- Feature-level guidance

FullText(HTML)

References(20)

References

[1]	LI Hui and WU Xiaojun. DenseFuse: A fusion approach to infrared and visible images[J]. IEEE Transactions on Image Processing, 2019, 28(5): 2614–2623. doi: 10.1109/TIP.2018.2887342.
[2]	ZHAO Zixiang, XU Shuang, ZHANG Chunxia, et al. DIDFuse: Deep image decomposition for infrared and visible image fusion[C]. The Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan, 2020: 970–976. doi: 10.24963/ijcai.2020/135.
[3]	XU Han, MA Jiayi, JIANG Junjun, et al. U2Fusion: A unified unsupervised image fusion network[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(1): 502–518. doi: 10.1109/TPAMI.2020.3012548.
[4]	TANG Linfeng, XIANG Xinyu, ZHANG Hao, et al. DIVFusion: Darkness-free infrared and visible image fusion[J]. Information Fusion, 2023, 91: 477–493. doi: 10.1016/j.inffus.2022.10.034.
[5]	ZHAO Zixiang, BAI Haowen, ZHANG Jiangshe, et al. CDDFuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion[C]. The 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 5906–5916. doi: 10.1109/CVPR52729.2023.00572.
[6]	杨莘, 田立凡, 梁佳明, 等. 改进双路径生成对抗网络的红外与可见光图像融合[J]. 电子与信息学报, 2023, 45(8): 3012–3021. doi: 10.11999/JEIT220819. YANG Shen, TIAN Lifan, LIANG Jiaming, et al. Infrared and visible image fusion based on improved dual path generation adversarial network[J]. Journal of Electronics & Information Technology, 2023, 45(8): 3012–3021. doi: 10.11999/JEIT220819.
[7]	LIU Xiangzeng, GAO Haojie, MIAO Qiguang, et al. MFST: Multi-modal feature self-adaptive transformer for infrared and visible image fusion[J]. Remote Sensing, 2022, 14(13): 3233. doi: 10.3390/rs14133233.
[8]	LIU Qiao, PI Jiatian, GAO Peng, et al. STFNet: Self-supervised transformer for infrared and visible image fusion[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2024, 8(2): 1513–1526. doi: 10.1109/TETCI.2024.3352490.
[9]	LIU Jinyuan, FAN Xin, HUANG Zhanbo, et al. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection[C]. The 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 5792–5801. doi: 10.1109/CVPR52688.2022.00571.
[10]	TANG Linfeng, YUAN Jiteng, and MA Jiayi. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network[J]. Information Fusion, 2022, 82: 28–42. doi: 10.1016/j.inffus.2021.12.004.
[11]	SUN Ke, XIAO Bin, LIU Dong, et al. Deep high-resolution representation learning for human pose estimation[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, 2019: 5686–5696. doi: 10.1109/CVPR.2019.00584.
[12]	CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2016: 3213–3223. doi: 10.1109/CVPR.2016.350.
[13]	HA Qishen, WATANABE K, KARASAWA T, et al. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes[C]. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, Canada, 2017: 5108-5115. doi: 10.1109/IROS.2017.8206396.
[14]	XU Han, MA Jiayi, LE Zhuliang, et al. FusionDN: A unified densely connected network for image fusion[C]. The 34th AAAI Conference on Artificial Intelligence, New York, USA, 2020: 12484–12491. doi: 10.1609/aaai.v34i07.6936.
[15]	QU Guihong, ZHANG Dali, and YAN Pingfan. Information measure for performance of image fusion[J]. Electronics Letters, 2002, 38(7): 313–315. doi: 10.1049/el:20020212.
[16]	HAN Yu, CAI Yunze, CAO Yin, et al. A new image fusion performance metric based on visual information fidelity[J]. Information Fusion, 2013, 14(2): 127–135. doi: 10.1016/j.inffus.2011.08.002.
[17]	CUI Guangmang, FENG Huajun, XU Zhihai, et al. Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition[J]. Optics Communications, 2015, 341: 199–209. doi: 10.1016/j.optcom.2014.12.032.
[18]	ASLANTAS V and BENDES E. A new image quality metric for image fusion: The sum of the correlations of differences[J]. AEU - International Journal of Electronics and Communications, 2015, 69(12): 1890–1896. doi: 10.1016/j.aeue.2015.09.004.
[19]	WANG Zhou, BOVIK A C, SHEIKH H R, et al. Image quality assessment: From error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600–612. doi: 10.1109/TIP.2003.819861.
[20]	BOBAN B and VLADIMIR P. Objective image fusion performance measures[J]. Vojnotehnički Glasnik, 2008, 56(2): 181–193. doi: 10.5937/vojtehg0802181B.