Semantic-guided Unified Multi-scale Deep Unrolling Network for Pansharpening
-
摘要: 现有的基于深度学习的遥感图像融合方法通常依赖于特定数据集进行训练,导致其泛化能力不足,难以适应多卫星场景的实际应用。为此,本文提出一种语义引导的一体化多尺度深度展开网络(SUM-DUN)。该网络基于传统融合问题的优化求解进行设计,采用3D架构以兼容不同波段数量的多光谱(MS)图像输入。通过引入多尺度特征分层处理机制,SUM-DUN能够有效提取并融合不同层级的特征。更重要的是,为实现一体化融合,本文创新性地引入多模态大语言模型,从输入的低分辨率多光谱(LRMS)图像与全色(PAN)图像中生成通用语义文本提示,以动态引导网络自适应地选择最优特征传递策略。多卫星实验结果表明,所提方法在多个数据集上的主观视觉效果和客观评价指标均得到显著提升。Abstract:
Objective With the rapid advancement of satellite imaging technologies, the demand for high-resolution multispectral remote sensing imagery has grown substantially across a wide range of applications. Due to the wide variety of satellite platforms, there exists a significant domain shift across datasets collected from different satellites. As a result, most existing deep learning (DL)-based pansharpening methods are trained individually for each satellite dataset, and consequently exhibit limited generalization capability across different satellites. To address these limitations, this study proposes a Semantic-guided Unified Multi-scale Deep Unrolling Network (SUM-DUN), which is designed based on classical optimization theory, adopting a 3D multi-scale deep unfolding architecture for integrated feature extraction and fusion. Leveraging multimodal large language models (MLLMs), the proposed method derives semantic textual prompts from the input images, which direct the model to adaptively adjust its feature representations and thereby enhance fusion quality. The proposed method aims to achieve unified remote sensing image fusion through tailored network architecture and prompt-guided mechanisms, thereby providing reliable support for high-level image interpretation tasks. Methods Following the Maximum A Posteriori(MAP) estimation principle, the optimization process for HRMS recovery is unfolded into the proposed SUM-DUN( Fig. 1 ). Each iteration stage of SUM-DUN consists of two main modules: a Gradient Descent Module (GDM) and a Semantic-guided Proximal Mapping Network (SPMN), which are used to approximate the operations in Eq. (5) and Eq. (6), respectively. GDM performs a gradient descent update based on the current feature estimate and the degradation model. The SPMN, implemented with a Transformer-based architecture as illustrated inFig. 2(b) , incorporates semantic textual prompts generated from the input image pair by MLLMs. These prompts guide the network to adaptively select appropriate feature propagation strategies for the current pair, helping suppress noise and mitigate discrepancies across different satellite sensors. Moreover, leveraging upsampling and downsampling operations, the network transmits MS and PAN features between iterative stages, thereby progressively preserving and enhancing multi-scale spatial and spectral information throughout the unfolding process.Results and Discussions To demonstrate the effectiveness of the proposed method, we compare the method against seven representative baselines, including 2 traditional methods (BDSD and PRACS) and 5 DL–based methods (AWFLN, FusionMamba, PanMamba, WFANet and TMDiff). For the reduced resolution evaluation, where ground-truth HRMS images are available, we adopt several widely-used reference based metrics, including Spectral Angle Mapper (SAM), Spatial Correlation Coefficient (SCC), Peak Signal-to-Noise Ratio(PSNR), Erreur Relative Global Adimensionnelle de Synthèse (ERGAS), Averaged Universal Image Quality Index(QAVE) and the Universal Image Quality Index for 4-band and 8-band images. These metrics jointly evaluate spectral fidelity, spatial consistency, and overall image quality. For the full-resolution evaluation, where ground-truth HRMS are unavailable, we rely on no-reference quality indices. Specifically, we employ the Hybrid Quality with No Reference (HQNR) metric, along with its spectral distortion component and spatial distortion component, to assess the fusion quality in real-world scenarios. Quantitative evaluations on the GF-1, QB, WV-2, and WV-4 test datasets demonstrate that the proposed method consistently achieves either the best or second-best performance across all metrics, under both reduced-resolution and full-resolution settings( Table 2 –3 ). These results clearly indicate that the proposed method is capable of simultaneously preserving spectral fidelity and spatial consistency, while maintaining robust performance across different satellites and remaining effective in more challenging scenarios. The ablation studies validate the effectiveness of the 3D architecture, the multi-scale network design, and the spatial–channel prompt guidance mechanism, as removing or altering any of these components leads to varying degrees of performance degradation(Table 4 -5 ).Conclusions This study proposes a semantic-guided unified multi-scale deep unfolding method for pansharpening, which leverages semantic prompts generated by a MLLM to facilitate efficient and unified fusion of images from different satellites. The proposed approach is built upon a deep unfolding framework and employs a 3D convolutional architecture to accommodate varying numbers of spectral bands across satellite datasets. The multi-scale network design is further incorporated to extract spatial and spectral features at different levels, thereby enhancing the fusion capability. In addition, the sematic prompt integration module is introduced to adaptively route spatial and channel features based on the extracted semantic information, enabling more effective feature propagation and improving both spatial detail reconstruction and spectral consistency. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance in terms of both visual quality and quantitative evaluation metrics. -
表 1 实验部分所使用的数据集信息
卫星 空间分辨率(m) 图像尺寸 训练集数量 验证集数量 降分辨率
测试集数量全分辨率
测试集数量MS PAN MS PAN GF-1 0.8 2.0 32×32×4 128×128 1386 154 100 64 QB 2.44 0.61 32×32×4 128×128 1710 190 100 64 WV-4 1.2 0.3 32×32×8 128×128 1710 190 100 64 WV-2 2.0 0.5 32×32×4 128×128 1710 190 100 64 表 2 GF-1测试数据集的定量比较
方法 降分辨率 全分辨率 $ {Q}_{4} $ QAVE SAM(rad) ERGAS SCC PSNR(dB) $ {D}_{\lambda } $ $ {D}_{S} $ HQNR BDSD 0.7526 0.7874 1.9305 1.5777 0.9446 39.6160 0.0415 0.0468 0.9139 PRACS 0.7314 0.7622 1.8491 1.5341 0.9377 39.4899 0.0632 0.1679 0.7810 AWFLN 0.9292 0.9394 0.6171 0.6377 0.9909 49.7031 0.0145 0.1600 0.8280 FusionMamba 0.9199 0.9368 0.6155 0.6418 0.9918 49.9123 0.0180 0.1605 0.8246 PanMamba 0.9502 0.9572 0.4932 0.5391 0.9940 51.4618 0.0156 0.1632 0.8239 WFANet 0.9443 0.9550 0.4729 0.5290 0.9947 51.9493 0.0115 0.1599 0.8307 TMDiff 0.9350 0.9410 0.5481 0.6313 0.9924 50.4383 0.0305 0.1939 0.7818 本文方法 0.9539 0.9614 0.4366 0.5104 0.9953 52.5321 0.0103 0.0816 0.9089 表 3 WV-2测试数据集的定量比较
方法 降分辨率 全分辨率 $ {Q}_{8} $ QAVE SAM(rad) ERGAS SCC PSNR(dB) $ {D}_{\lambda } $ $ {D}_{S} $ HQNR BDSD 0.6914 0.7031 4.9777 4.3408 0.9434 36.7286 0.0525 0.1465 0.8078 PRACS 0.6624 0.6677 4.7022 4.8873 0.9121 35.9620 0.0147 0.1145 0.8727 AWFLN 0.9133 0.9143 0.8205 0.6002 0.9914 49.1352 0.0191 0.0933 0.8893 FusionMamba 0.9082 0.9120 0.8584 0.6179 0.9917 49.0684 0.0215 0.0635 0.9164 PanMamba 0.7613 0.7606 2.7455 2.5952 0.9839 41.0233 0.0533 0.0897 0.8621 WFANet 0.7608 0.7599 2.7611 2.6042 0.9838 40.9669 0.0457 0.0821 0.8762 TMDiff 0.7561 0.7548 2.7895 2.6382 0.9831 40.8775 0.0567 0.0533 0.8931 本文方法 0.7675 0.7673 2.6959 2.4934 0.9855 41.3410 0.0154 0.0288 0.9562 表 4 网络架构消融实验PSNR指标结果(单位:dB)
编号 退化算子 近端网络 多尺度架构 GF-1 QB WV-2 WV-4 Ⅰ 2D 2D √ 52.4677 50.0497 40.9120 43.5865 Ⅱ 3D 2D √ 52.5387 50.0891 40.9077 43.7566 Ⅲ 3D 3D √ 52.3679 50.1103 41.2707 43.9447 Ⅳ 3D 3D × 51.7431 50.0722 41.2601 43.7500 表 5 提示引导机制消融实验PSNR指标结果(单位:dB)
编号 方法 GF-1 QB WV-2 WV-4 Ⅰ 通道 52.3885 50.1519 41.2963 44.0624 Ⅱ 空间 52.4803 50.1578 41.3295 44.1117 Ⅲ 空间–通道 52.5321 50.2629 41.3410 44.2020 Ⅳ 交叉注意力 51.3079 49.7873 41.0228 43.5290 -
[1] THOMAS C, RANCHIN T, WALD L, et al. Synthesis of multispectral images to high spatial resolution: A critical review of fusion methods based on remote sensing physics[J]. IEEE Transactions on Geoscience and Remote Sensing, 2008, 46(5): 1301–1312. doi: 10.1109/TGRS.2007.912448. [2] 金晶, 王峰. 分布式多卫星协同遥感图像场景分类方法[J]. 电子与信息学报, 2025, 47(12): 4677–4688. doi: 10.11999/JEIT250866.JIN Jing and WANG Feng. A distributed multi-satellite collaborative framework for remote sensing scene classification[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4677–4688. doi: 10.11999/JEIT250866. [3] 文泓力, 胡庆浩, 黄立威, 等. 基于参数高效ViT与多模态导引的遥感图像小样本分类方法[J]. 电子与信息学报, 2025, 47(12): 4689–4703. doi: 10.11999/JEIT250996.WEN Hongli, HU Qinghao, HUANG Liwei, et al. Few-shot remote sensing image classification based on parameter-efficient vision transformer and multimodal guidance[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4689–4703. doi: 10.11999/JEIT250996. [4] 韩汶杞, 蒋雯, 耿杰, 等. 原型对齐与拓扑一致性约束下的多模态半监督遥感图像语义分割[J]. 电子与信息学报, 2025, 47(12): 4714–4727. doi: 10.11999/JEIT251115.HAN Wenqi, JIANG Wen, GENG Jie, et al. PATC: Prototype alignment and topology-consistent pseudo-supervision for multimodal semi-supervised semantic segmentation of remote sensing images[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4714–4727. doi: 10.11999/JEIT251115. [5] ZENG Delu, HU Yuwen, HUANG Yue, et al. Pan-sharpening with structural consistency and ℓ1/2 gradient prior[J]. Remote Sensing Letters, 2016, 7(12): 1170–1179. doi: 10.1080/2150704X.2016.1222098. [6] WU Zhongcheng, HUANG Tingzhu, DENG Liangjian, et al. VO+Net: An adaptive approach using variational optimization and deep learning for panchromatic sharpening[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5401016. doi: 10.1109/TGRS.2021.3066425. [7] LU Hangyuan, YANG Yong, HUANG Shuying, et al. AWFLN: An adaptive weighted feature learning network for pansharpening[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5400815. doi: 10.1109/TGRS.2023.3241643. [8] XIE Xinyu, CUI Yawen, TAN Tao, et al. FusionMamba: Dynamic feature enhancement for multimodal image fusion with mamba[J]. Visual Intelligence, 2024, 2(1): 37. doi: 10.1007/s44267-024-00072-9. [9] HE Xuanhua, CAO Ke, ZHANG Jie, et al. Pan-mamba: Effective pan-sharpening with state space model[J]. Information Fusion, 2025, 115: 102779. doi: 10.1016/j.inffus.2024.102779. [10] HUANG Jie, HUANG Rui, XU Jinghao, et al. Wavelet-assisted multi-frequency attention network for pansharpening[C]. Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 3662–3670. doi: 10.1609/aaai.v39i4.32381. [11] JIA Menglin, TANG Luming, CHEN B C, et al. JIA Menglin, TANG Luming, CHEN B C, et al. Visual prompt tuning[C]. 17th European Conference on Computer Vision, Tel-Aviv, Israel, 2022: 709–727. doi: 10.1007/978-3-031-19827-4_41. [12] NIE Xing, NI Bolin, CHANG Jianlong, et al. Pro-tuning: Unified prompt tuning for vision tasks[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(6): 4653–4667. doi: 10.1109/TCSVT.2023.3327605. [13] CUI Yuning, ZAMIR S W, KHAN S, et al. AdaIR: Adaptive all-in-one image restoration via frequency mining and modulation[C]. The Thirteenth International Conference on Learning Representations, Singapore, Singapore, 2025: 57335–57356. (查阅网上资料, 未找到本条文献页码信息, 请确认). [14] ZENG Haijin, WANG Xiangming, CHEN Yongyong, et al. Vision-language gradient descent-driven all-in-one deep unfolding networks[C]. Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, United States, 2025: 7524–7533. doi: 10.1109/CVPR52734.2025.00705. [15] YANG Zhiwen, CHEN Haowei, QIAN Ziniu, et al. All-in-one medical image restoration via task-adaptive routing[C]. 27th International Conference on Medical Image Computing and Computer Assisted Intervention, Marrakesh, Morocco, 2024: 67–77. doi: 10.1007/978-3-031-72104-5_7. [16] XING Yinghui, QU Litao, ZHANG Shizhou, et al. Empower generalizability for pansharpening through text-modulated diffusion model[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5633812. doi: 10.1109/TGRS.2024.3434431. [17] LI Xueheng, HE Xuanhua, CAO Ke, et al. Exploring text-guided information fusion through chain-of-reasoning for pansharpening[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5407314. doi: 10.1109/TGRS.2025.3604447. [18] FANG Shijie and GAN Hongping. SSUN-net: Spatial-spectral prior-aware unfolding network for pan-sharpening[C]. Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, United States, 2025: 2897–2905. doi: 10.1609/aaai.v39i3.32296. [19] YAN Qiuhai, JIANG Aiwen, CHEN Kang, et al. Textual prompt guided image restoration[J]. Engineering Applications of Artificial Intelligence, 2025, 155: 110981. doi: 10.1016/j.engappai.2025.110981. [20] CONDE M V, GEIGLE G, and TIMOFTE R. InstructIR: High-quality image restoration following human instructions[C]. 18th European Conference on Computer Vision, Milan, Italy, 2024: 1–21. doi: 10.1007/978-3-031-72764-1_1. [21] ZENG Aohan, XU Bin, WANG Bowen, et al. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools[EB/OL]. https://arxiv.org/abs/2406.12793, 2024. [22] XIAO Shitao, LIU Zheng, ZHANG Peitian, et al. C-pack: Packed resources for general Chinese embeddings[C]. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, USA, 2024: 641–649. doi: 10.1145/3626772.3657878. [23] MENG Xiangchao, XIONG Yiming, SHAO Feng, et al. A large-scale benchmark data set for evaluating pansharpening performance: Overview and implementation[J]. IEEE Geoscience and Remote Sensing Magazine, 2021, 9(1): 18–52. doi: 10.1109/MGRS.2020.2976696. [24] WALD L, RANCHIN T, and MANGOLINI M. Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images[J]. Photogrammetric Engineering and Remote Sensing, 1997, 63(6): 691–699. [25] GARZELLI A, NENCINI F, and CAPOBIANCO L. Optimal MMSE pan sharpening of very high resolution multispectral images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2008, 46(1): 228–236. doi: 10.1109/TGRS.2007.907604. [26] CHOI J, YU Kiyun, and KIM Y. A new adaptive component-substitution-based satellite image fusion by using partial replacement[J]. IEEE Transactions on Geoscience and Remote Sensing, 2011, 49(1): 295–309. doi: 10.1109/TGRS.2010.2051674. [27] VIVONE G, ALPARONE L, CHANUSSOT J, et al. A critical comparison among pansharpening algorithms[J]. IEEE Transactions on Geoscience and Remote Sensing, 2015, 53(5): 2565–2586. doi: 10.1109/TGRS.2014.2361734. [28] ZHOU Jie, CIVCO D L, and SILANDER J A. A wavelet transform method to merge Landsat TM and SPOT panchromatic data[J]. International Journal of Remote Sensing, 1998, 19(4): 743–757. doi: 10.1080/014311698215973. [29] WALD L. Data Fusion. Definitions and Architectures - Fusion of Images of Different Spatial Resolutions[M]. Paris, France: Presses de l’École, Ecole des Mines de Paris, 2002: 165–189. [30] WANG Zhou and BOVIK A C. A universal image quality index[J]. IEEE Signal Processing Letters, 2002, 9(3): 81–84. doi: 10.1109/97.995823. [31] GARZELLI A and NENCINI F. Hypercomplex quality assessment of multi/hyperspectral images[J]. IEEE Geoscience and Remote Sensing Letters, 2009, 6(4): 662–665. doi: 10.1109/LGRS.2009.2022650. [32] ARIENZO A, VIVONE G, GARZELLI A, et al. Full-resolution quality assessment of pansharpening: Theoretical and hands-on approaches[J]. IEEE Geoscience and Remote Sensing Magazine, 2022, 10(3): 168–201. doi: 10.1109/MGRS.2022.3170092. -
下载:
下载: