Semantic-guided Unified Multi-scale Deep Unrolling Network for Pansharpening
-
摘要: 现有的基于深度学习的遥感图像融合方法通常依赖于特定数据集进行训练,导致其泛化能力不足,难以适应多卫星场景的实际应用。为此,该文提出一种语义引导的一体化多尺度深度展开网络(SUM-DUN)。该网络基于传统融合问题的优化求解进行设计,采用3D架构以兼容不同波段数量的多光谱(MS)图像输入。通过引入多尺度特征分层处理机制,SUM-DUN能够有效提取并融合不同层级的特征。更重要的是,为实现一体化融合,该文创新性地引入多模态大语言模型,从输入的低分辨率多光谱(LRMS)图像与全色(PAN)图像中生成通用语义文本提示,以动态引导网络自适应地选择最优特征传递策略。多卫星实验结果表明,所提方法在多个数据集上的主观视觉效果和客观评价指标均得到显著提升。Abstract:
Objective With the rapid development of satellite imaging technology, demand has increased for high-resolution multispectral remote sensing images in a wide range of applications. However, satellite platforms differ in sensor parameters and imaging conditions, which leads to clear domain shifts among datasets acquired by different satellites. Most existing Deep Learning (DL)-based pansharpening methods are therefore trained separately on individual satellite datasets and have limited cross-satellite generalization. To address this limitation, this study proposes a Semantic-guided Unified Multi-scale Deep Unrolling Network (SUM-DUN). SUM-DUN is designed based on classical optimization theory and adopts a three-dimensional (3D) multi-scale deep unrolling architecture for unified feature extraction and fusion. Multimodal Large Language Models (MLLMs) are used to generate semantic text prompts from the input images. These prompts guide the model to adaptively adjust feature representations and improve fusion quality. The proposed method aims to support unified remote sensing image fusion through a tailored network architecture and a prompt-guided mechanism, thereby providing reliable data for high-level image interpretation tasks. Methods Following the Maximum A Posteriori (MAP) estimation principle, the optimization process for High-Resolution Multispectral (HRMS) image recovery is unfolded into the proposed SUM-DUN (Fig. 1). Each iterative stage of SUM-DUN contains two main modules: a Gradient Descent Module (GDM) and a Semantic-guided Proximal Mapping Network (SPMN). These modules approximate the operations in Eq. (5) and Eq. (6), respectively. GDM performs gradient descent updating based on the current feature estimate and the degradation model. SPMN is implemented using a Transformer-based architecture, as shown in Fig. 2(b), and incorporates semantic text prompts generated from each input image pair by MLLMs. These prompts guide the network to select suitable feature propagation strategies for the current image pair. This process helps suppress noise and reduce discrepancies among different satellite sensors. Through upsampling and downsampling operations, the network also transmits multispectral (MS) and panchromatic (PAN) features across iterative stages. Thus, multi-scale spatial and spectral information is progressively preserved and enhanced during the deep unrolling process. Results and Discussions To verify the effectiveness of the proposed method, it is compared with seven representative baselines, including two traditional methods, BDSD and PRACS, and five DL-based methods, AWFLN, FusionMamba, PanMamba, WFANet, and TMDiff. In the reduced-resolution evaluation, ground-truth HRMS images are available. Several widely used reference-based metrics are adopted, including Spectral Angle Mapper (SAM), Spatial Correlation Coefficient (SCC), Peak Signal-to-Noise Ratio (PSNR), Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS), Averaged Universal Image Quality Index (QAVE), and the Universal Image Quality Index for 4-band and 8-band images (Q4/Q8). These metrics jointly assess spectral fidelity, spatial consistency, and overall image quality. In the full-resolution evaluation, ground-truth HRMS images are unavailable. Therefore, no-reference quality indices are used. Specifically, Hybrid Quality with No Reference (HQNR), its spectral distortion component and spatial distortion component are used to assess fusion quality in real-world scenarios. Quantitative results on the GF-1, QB, WV-2, and WV-4 test datasets show that the proposed method consistently achieves the best or second-best performance across all metrics under both reduced-resolution and full-resolution settings ( Tables 2 and3 ). These results indicate that the proposed method can preserve spectral fidelity and spatial consistency while maintaining robust performance across different satellites and challenging imaging conditions. Ablation studies further validate the effectiveness of the 3D architecture, the multi-scale network design, and the spatial-channel prompt guidance mechanism. Removing or modifying any of these components causes performance degradation to different degrees (Tables 4 and5 ).Conclusions This study proposes a semantic-guided unified multi-scale deep unrolling method for pansharpening. The method uses semantic prompts generated by an MLLM to support efficient and unified fusion of images from different satellites. The proposed approach is built on a deep unrolling framework and uses a 3D convolutional architecture to process satellite datasets with different numbers of spectral bands. A multi-scale network design is further used to extract spatial and spectral features at different levels, thereby improving fusion performance. In addition, a Semantic Prompt Integration Module (SPIM) is designed to adaptively route spatial and channel features based on semantic information. SPIM enables more effective feature propagation and improves both spatial detail reconstruction and spectral consistency. Extensive experiments show that the proposed method achieves state-of-the-art performance in visual quality and quantitative evaluation. -
表 1 实验部分所使用的数据集信息
卫星 空间分辨率(m) 图像尺寸 训练集数量 验证集数量 降分辨率
测试集数量全分辨率
测试集数量MS PAN MS PAN GF-1 0.8 2.0 32×32×4 128×128 1386 154 100 64 QB 2.44 0.61 32×32×4 128×128 1710 190 100 64 WV-4 1.2 0.3 32×32×4 128×128 1710 190 100 64 WV-2 2.0 0.5 32×32×8 128×128 1710 190 100 64 表 2 GF-1测试数据集的定量比较
方法 降分辨率 全分辨率 $ {Q}_{4} $ QAVE SAM(rad) ERGAS SCC PSNR(dB) $ {D}_{\lambda } $ $ {D}_{S} $ HQNR BDSD 0.7526 0.7874 1.9305 1.5777 0.9446 39.6160 0.0415 0.0468 0.9139 PRACS 0.7314 0.7622 1.8491 1.5341 0.9377 39.4899 0.0632 0.1679 0.7810 AWFLN 0.9292 0.9394 0.6171 0.6377 0.9909 49.7031 0.0145 0.1600 0.8280 FusionMamba 0.9199 0.9368 0.6155 0.6418 0.9918 49.9123 0.0180 0.1605 0.8246 PanMamba 0.9502 0.9572 0.4932 0.5391 0.9940 51.4618 0.0156 0.1632 0.8239 WFANet 0.9443 0.9550 0.4729 0.5290 0.9947 51.9493 0.0115 0.1599 0.8307 TMDiff 0.9350 0.9410 0.5481 0.6313 0.9924 50.4383 0.0305 0.1939 0.7818 本文方法 0.9539 0.9614 0.4366 0.5104 0.9953 52.5321 0.0103 0.0816 0.9089 表 3 WV-2测试数据集的定量比较
方法 降分辨率 全分辨率 $ {Q}_{8} $ QAVE SAM(rad) ERGAS SCC PSNR(dB) $ {D}_{\lambda } $ $ {D}_{S} $ HQNR BDSD 0.6914 0.7031 4.9777 4.3408 0.9434 36.7286 0.0525 0.1465 0.8078 PRACS 0.6624 0.6677 4.7022 4.8873 0.9121 35.9620 0.0147 0.1145 0.8727 AWFLN 0.9133 0.9143 0.8205 0.6002 0.9914 49.1352 0.0191 0.0933 0.8893 FusionMamba 0.9082 0.9120 0.8584 0.6179 0.9917 49.0684 0.0215 0.0635 0.9164 PanMamba 0.7613 0.7606 2.7455 2.5952 0.9839 41.0233 0.0533 0.0897 0.8621 WFANet 0.7608 0.7599 2.7611 2.6042 0.9838 40.9669 0.0457 0.0821 0.8762 TMDiff 0.7561 0.7548 2.7895 2.6382 0.9831 40.8775 0.0567 0.0533 0.8931 本文方法 0.7675 0.7673 2.6959 2.4934 0.9855 41.3410 0.0154 0.0288 0.9562 表 4 网络架构消融实验PSNR指标结果(dB)
编号 退化算子 近端网络 多尺度架构 GF-1 QB WV-2 WV-4 Ⅰ 2D 2D √ 52.4677 50.0497 40.9120 43.5865 Ⅱ 3D 2D √ 52.5387 50.0891 40.9077 43.7566 Ⅲ 3D 3D √ 52.3679 50.1103 41.2707 43.9447 Ⅳ 3D 3D × 51.7431 50.0722 41.2601 43.7500 表 5 提示引导机制消融实验PSNR指标结果(单位:dB)
编号 方法 GF-1 QB WV-2 WV-4 Ⅰ 通道 52.3885 50.1519 41.2963 44.0624 Ⅱ 空间 52.4803 50.1578 41.3295 44.1117 Ⅲ 空间–通道 52.5321 50.2629 41.3410 44.2020 Ⅳ 交叉注意力 51.3079 49.7873 41.0228 43.5290 -
[1] THOMAS C, RANCHIN T, WALD L, et al. Synthesis of multispectral images to high spatial resolution: A critical review of fusion methods based on remote sensing physics[J]. IEEE Transactions on Geoscience and Remote Sensing, 2008, 46(5): 1301–1312. doi: 10.1109/TGRS.2007.912448. [2] 金晶, 王峰. 分布式多卫星协同遥感图像场景分类方法[J]. 电子与信息学报, 2025, 47(12): 4677–4688. doi: 10.11999/JEIT250866.JIN Jing and WANG Feng. A distributed multi-satellite collaborative framework for remote sensing scene classification[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4677–4688. doi: 10.11999/JEIT250866. [3] 文泓力, 胡庆浩, 黄立威, 等. 基于参数高效ViT与多模态导引的遥感图像小样本分类方法[J]. 电子与信息学报, 2025, 47(12): 4689–4703. doi: 10.11999/JEIT250996.WEN Hongli, HU Qinghao, HUANG Liwei, et al. Few-shot remote sensing image classification based on parameter-efficient vision transformer and multimodal guidance[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4689–4703. doi: 10.11999/JEIT250996. [4] 韩汶杞, 蒋雯, 耿杰, 等. 原型对齐与拓扑一致性约束下的多模态半监督遥感图像语义分割[J]. 电子与信息学报, 2025, 47(12): 4714–4727. doi: 10.11999/JEIT251115.HAN Wenqi, JIANG Wen, GENG Jie, et al. PATC: Prototype alignment and topology-consistent pseudo-supervision for multimodal semi-supervised semantic segmentation of remote sensing images[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4714–4727. doi: 10.11999/JEIT251115. [5] ZENG Delu, HU Yuwen, HUANG Yue, et al. Pan-sharpening with structural consistency and ℓ1/2 gradient prior[J]. Remote Sensing Letters, 2016, 7(12): 1170–1179. doi: 10.1080/2150704X.2016.1222098. [6] WU Zhongcheng, HUANG Tingzhu, DENG Liangjian, et al. VO+Net: An adaptive approach using variational optimization and deep learning for panchromatic sharpening[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5401016. doi: 10.1109/TGRS.2021.3066425. [7] LU Hangyuan, YANG Yong, HUANG Shuying, et al. AWFLN: An adaptive weighted feature learning network for pansharpening[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5400815. doi: 10.1109/TGRS.2023.3241643. [8] XIE Xinyu, CUI Yawen, TAN Tao, et al. FusionMamba: Dynamic feature enhancement for multimodal image fusion with mamba[J]. Visual Intelligence, 2024, 2(1): 37. doi: 10.1007/s44267-024-00072-9. [9] HE Xuanhua, CAO Ke, ZHANG Jie, et al. Pan-mamba: Effective pan-sharpening with state space model[J]. Information Fusion, 2025, 115: 102779. doi: 10.1016/j.inffus.2024.102779. [10] HUANG Jie, HUANG Rui, XU Jinghao, et al. Wavelet-assisted multi-frequency attention network for pansharpening[C]. The 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 3662–3670. doi: 10.1609/aaai.v39i4.32381. [11] JIA Menglin, TANG Luming, CHEN B C, et al. JIA Menglin, TANG Luming, CHEN B C, et al. Visual prompt tuning[C]. The 17th European Conference on Computer Vision, Tel-Aviv, Israel, 2022: 709–727. doi: 10.1007/978-3-031-19827-4_41. [12] NIE Xing, NI Bolin, CHANG Jianlong, et al. Pro-tuning: Unified prompt tuning for vision tasks[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(6): 4653–4667. doi: 10.1109/TCSVT.2023.3327605. [13] CUI Yuning, ZAMIR S W, KHAN S, et al. AdaIR: Adaptive all-in-one image restoration via frequency mining and modulation[C]. The Thirteenth International Conference on Learning Representations, Singapore, 2025: 57335–57356. [14] 10.1109/CVPR52734.2025.0070510.1109/CVPR52734.2025.00705 ZENG Haijin, WANG Xiangming, CHEN Yongyong, et al. Vision-language gradient descent-driven all-in-one deep unfolding networks[C]. The 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, United States, 2025: 7524–7533. [15] YANG Zhiwen, CHEN Haowei, QIAN Ziniu, et al. All-in-one medical image restoration via task-adaptive routing[C]. 27th International Conference on Medical Image Computing and Computer Assisted Intervention, Marrakesh, Morocco, 2024: 67–77. doi: 10.1007/978-3-031-72104-5_7. [16] XING Yinghui, QU Litao, ZHANG Shizhou, et al. Empower generalizability for pansharpening through text-modulated diffusion model[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5633812. doi: 10.1109/TGRS.2024.3434431. [17] LI Xueheng, HE Xuanhua, CAO Ke, et al. Exploring text-guided information fusion through chain-of-reasoning for pansharpening[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5407314. doi: 10.1109/TGRS.2025.3604447. [18] FANG Shijie and GAN Hongping. SSUN-net: Spatial-spectral prior-aware unfolding network for pan-sharpening[C]. The 39th AAAI Conference on Artificial Intelligence, Philadelphia, United States, 2025: 2897–2905. doi: 10.1609/aaai.v39i3.32296. [19] YAN Qiuhai, JIANG Aiwen, CHEN Kang, et al. Textual prompt guided image restoration[J]. Engineering Applications of Artificial Intelligence, 2025, 155: 110981. doi: 10.1016/j.engappai.2025.110981. [20] CONDE M V, GEIGLE G, and TIMOFTE R. InstructIR: High-quality image restoration following human instructions[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 1–21. doi: 10.1007/978-3-031-72764-1_1. [21] ZENG Aohan, XU Bin, WANG Bowen, et al. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools[EB/OL]. https://arxiv.org/abs/2406.12793, 2024. [22] XIAO Shitao, LIU Zheng, ZHANG Peitian, et al. C-pack: Packed resources for general Chinese embeddings[C]. The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, USA, 2024: 641–649. doi: 10.1145/3626772.3657878. [23] MENG Xiangchao, XIONG Yiming, SHAO Feng, et al. A large-scale benchmark data set for evaluating pansharpening performance: Overview and implementation[J]. IEEE Geoscience and Remote Sensing Magazine, 2021, 9(1): 18–52. doi: 10.1109/MGRS.2020.2976696. [24] WALD L, RANCHIN T, and MANGOLINI M. Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images[J]. Photogrammetric Engineering and Remote Sensing, 1997, 63(6): 691–699. [25] GARZELLI A, NENCINI F, and CAPOBIANCO L. Optimal MMSE pan sharpening of very high resolution multispectral images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2008, 46(1): 228–236. doi: 10.1109/TGRS.2007.907604. [26] CHOI J, YU Kiyun, and KIM Y. A new adaptive component-substitution-based satellite image fusion by using partial replacement[J]. IEEE Transactions on Geoscience and Remote Sensing, 2011, 49(1): 295–309. doi: 10.1109/TGRS.2010.2051674. [27] VIVONE G, ALPARONE L, CHANUSSOT J, et al. A critical comparison among pansharpening algorithms[J]. IEEE Transactions on Geoscience and Remote Sensing, 2015, 53(5): 2565–2586. doi: 10.1109/TGRS.2014.2361734. [28] ZHOU Jie, CIVCO D L, and SILANDER J A. A wavelet transform method to merge Landsat TM and SPOT panchromatic data[J]. International Journal of Remote Sensing, 1998, 19(4): 743–757. doi: 10.1080/014311698215973. [29] WALD L. Data Fusion. Definitions and Architectures - Fusion of Images of Different Spatial Resolutions[M]. Paris, France: Presses de l’École, Ecole des Mines de Paris, 2002: 165–189. [30] WANG Zhou and BOVIK A C. A universal image quality index[J]. IEEE Signal Processing Letters, 2002, 9(3): 81–84. doi: 10.1109/97.995823. [31] GARZELLI A and NENCINI F. Hypercomplex quality assessment of multi/hyperspectral images[J]. IEEE Geoscience and Remote Sensing Letters, 2009, 6(4): 662–665. doi: 10.1109/LGRS.2009.2022650. [32] ARIENZO A, VIVONE G, GARZELLI A, et al. Full-resolution quality assessment of pansharpening: Theoretical and hands-on approaches[J]. IEEE Geoscience and Remote Sensing Magazine, 2022, 10(3): 168–201. doi: 10.1109/MGRS.2022.3170092. -
下载:
下载: