语义引导的多尺度深度展开一体化遥感图像融合方法

陈俊杰; 汪婷婷; 方发明; 张桂戌

doi:10.11999/JEIT251252

语义引导的多尺度深度展开一体化遥感图像融合方法

doi: 10.11999/JEIT251252 cstr: 32379.14.JEIT251252

华东师范大学计算机科学与技术学院上海 200062

基金项目: 科技创新2030--“新一代人工智能”重大项目(2022ZD0161800)，国家自然科学基金项目(62202173, 62271203)，华东师范大学KLATASDS-MOE开放研究基金

详细信息

作者简介:
陈俊杰：男，硕士生，研究方向为遥感图像融合

汪婷婷：女，博士后，研究方向为图像增强与图像融合

方发明：男，教授，研究方向为机器学习、图像处理

张桂戌：男，教授，研究方向为图像处理与模式识别

通讯作者:
张桂戌　gxzhang@cs.ecnu.edu.cn

中图分类号: TP751
计量
- 文章访问数: 114
- HTML全文浏览量: 38
- PDF下载量: 17
- 被引次数: 0
出版历程
- 收稿日期: 2025-11-26
- 修回日期: 2026-04-17
- 录用日期: 2026-04-17
- 网络出版日期: 2026-05-03

Semantic-guided Unified Multi-scale Deep Unrolling Network for Pansharpening

School of Computer Science and Technology, East China Normal University, Shanghai 200062, China

Funds: The National Key Research and Development Program of China (2022ZD0161800), The National Natural Science Foundation of China, The Open Research Fund of KLATASDS-MOE, ECNU

摘要

摘要: 现有的基于深度学习的遥感图像融合方法通常依赖于特定数据集进行训练，导致其泛化能力不足，难以适应多卫星场景的实际应用。为此，本文提出一种语义引导的一体化多尺度深度展开网络(SUM-DUN)。该网络基于传统融合问题的优化求解进行设计，采用3D架构以兼容不同波段数量的多光谱(MS)图像输入。通过引入多尺度特征分层处理机制，SUM-DUN能够有效提取并融合不同层级的特征。更重要的是，为实现一体化融合，本文创新性地引入多模态大语言模型，从输入的低分辨率多光谱(LRMS)图像与全色(PAN)图像中生成通用语义文本提示，以动态引导网络自适应地选择最优特征传递策略。多卫星实验结果表明，所提方法在多个数据集上的主观视觉效果和客观评价指标均得到显著提升。
- 遥感图像融合 /
- 一体化全色锐化 /
- 视觉语言大模型 /
- 深度展开网络
Abstract: Objective With the rapid advancement of satellite imaging technologies, the demand for high-resolution multispectral remote sensing imagery has grown substantially across a wide range of applications. Due to the wide variety of satellite platforms, there exists a significant domain shift across datasets collected from different satellites. As a result, most existing deep learning (DL)-based pansharpening methods are trained individually for each satellite dataset, and consequently exhibit limited generalization capability across different satellites. To address these limitations, this study proposes a Semantic-guided Unified Multi-scale Deep Unrolling Network (SUM-DUN), which is designed based on classical optimization theory, adopting a 3D multi-scale deep unfolding architecture for integrated feature extraction and fusion. Leveraging multimodal large language models (MLLMs), the proposed method derives semantic textual prompts from the input images, which direct the model to adaptively adjust its feature representations and thereby enhance fusion quality. The proposed method aims to achieve unified remote sensing image fusion through tailored network architecture and prompt-guided mechanisms, thereby providing reliable support for high-level image interpretation tasks. Methods Following the Maximum A Posteriori(MAP) estimation principle, the optimization process for HRMS recovery is unfolded into the proposed SUM-DUN(Fig. 1). Each iteration stage of SUM-DUN consists of two main modules: a Gradient Descent Module (GDM) and a Semantic-guided Proximal Mapping Network (SPMN), which are used to approximate the operations in Eq. (5) and Eq. (6), respectively. GDM performs a gradient descent update based on the current feature estimate and the degradation model. The SPMN, implemented with a Transformer-based architecture as illustrated in Fig. 2(b), incorporates semantic textual prompts generated from the input image pair by MLLMs. These prompts guide the network to adaptively select appropriate feature propagation strategies for the current pair, helping suppress noise and mitigate discrepancies across different satellite sensors. Moreover, leveraging upsampling and downsampling operations, the network transmits MS and PAN features between iterative stages, thereby progressively preserving and enhancing multi-scale spatial and spectral information throughout the unfolding process. Results and Discussions To demonstrate the effectiveness of the proposed method, we compare the method against seven representative baselines, including 2 traditional methods (BDSD and PRACS) and 5 DL–based methods (AWFLN, FusionMamba, PanMamba, WFANet and TMDiff). For the reduced resolution evaluation, where ground-truth HRMS images are available, we adopt several widely-used reference based metrics, including Spectral Angle Mapper (SAM), Spatial Correlation Coefficient (SCC), Peak Signal-to-Noise Ratio(PSNR), Erreur Relative Global Adimensionnelle de Synthèse (ERGAS), Averaged Universal Image Quality Index(QAVE) and the Universal Image Quality Index for 4-band and 8-band images. These metrics jointly evaluate spectral fidelity, spatial consistency, and overall image quality. For the full-resolution evaluation, where ground-truth HRMS are unavailable, we rely on no-reference quality indices. Specifically, we employ the Hybrid Quality with No Reference (HQNR) metric, along with its spectral distortion component and spatial distortion component, to assess the fusion quality in real-world scenarios. Quantitative evaluations on the GF-1, QB, WV-2, and WV-4 test datasets demonstrate that the proposed method consistently achieves either the best or second-best performance across all metrics, under both reduced-resolution and full-resolution settings(Table 2–3). These results clearly indicate that the proposed method is capable of simultaneously preserving spectral fidelity and spatial consistency, while maintaining robust performance across different satellites and remaining effective in more challenging scenarios. The ablation studies validate the effectiveness of the 3D architecture, the multi-scale network design, and the spatial–channel prompt guidance mechanism, as removing or altering any of these components leads to varying degrees of performance degradation(Table 4-5). Conclusions This study proposes a semantic-guided unified multi-scale deep unfolding method for pansharpening, which leverages semantic prompts generated by a MLLM to facilitate efficient and unified fusion of images from different satellites. The proposed approach is built upon a deep unfolding framework and employs a 3D convolutional architecture to accommodate varying numbers of spectral bands across satellite datasets. The multi-scale network design is further incorporated to extract spatial and spectral features at different levels, thereby enhancing the fusion capability. In addition, the sematic prompt integration module is introduced to adaptively route spatial and channel features based on the extracted semantic information, enabling more effective feature propagation and improving both spatial detail reconstruction and spectral consistency. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance in terms of both visual quality and quantitative evaluation metrics.
- Remote sensing image fusion /
- Unified pansharpening /
- Multimodal large language model /
- Deep unfolding network

HTML全文

图 1 算法流程框架图

下载: 全尺寸图片幻灯片

图 2 梯度下降模块和近端网络结构图

下载: 全尺寸图片幻灯片

图 3 GF-1测试数据集的降分辨率实验融合结果和残差图

下载: 全尺寸图片幻灯片

图 4 WV-2测试数据集的降分辨率实验融合结果和残差图

下载: 全尺寸图片幻灯片

图 5 GF-1测试数据集的全分辨率实验融合结果

下载: 全尺寸图片幻灯片

图 6 SPIM输入输出特征t-SNE可视化结果和对应语义示例

下载: 全尺寸图片幻灯片

表 1 实验部分所使用的数据集信息

卫星	空间分辨率(m)		图像尺寸		训练集数量	验证集数量	降分辨率测试集数量	全分辨率测试集数量
卫星	MS	PAN	MS	PAN	训练集数量	验证集数量	降分辨率测试集数量	全分辨率测试集数量
GF-1	0.8	2.0	32×32×4	128×128	1386	154	100	64
QB	2.44	0.61	32×32×4	128×128	1710	190	100	64
WV-4	1.2	0.3	32×32×8	128×128	1710	190	100	64
WV-2	2.0	0.5	32×32×4	128×128	1710	190	100	64

下载: 导出CSV

表 2 GF-1测试数据集的定量比较

方法	降分辨率						全分辨率
方法	$ {Q}_{4} $	QAVE	SAM(rad)	ERGAS	SCC	PSNR(dB)	$ {D}_{\lambda } $	$ {D}_{S} $	HQNR
BDSD	0.7526	0.7874	1.9305	1.5777	0.9446	39.6160	0.0415	0.0468	0.9139
PRACS	0.7314	0.7622	1.8491	1.5341	0.9377	39.4899	0.0632	0.1679	0.7810
AWFLN	0.9292	0.9394	0.6171	0.6377	0.9909	49.7031	0.0145	0.1600	0.8280
FusionMamba	0.9199	0.9368	0.6155	0.6418	0.9918	49.9123	0.0180	0.1605	0.8246
PanMamba	0.9502	0.9572	0.4932	0.5391	0.9940	51.4618	0.0156	0.1632	0.8239
WFANet	0.9443	0.9550	0.4729	0.5290	0.9947	51.9493	0.0115	0.1599	0.8307
TMDiff	0.9350	0.9410	0.5481	0.6313	0.9924	50.4383	0.0305	0.1939	0.7818
本文方法	0.9539	0.9614	0.4366	0.5104	0.9953	52.5321	0.0103	0.0816	0.9089

下载: 导出CSV

表 3 WV-2测试数据集的定量比较

方法	降分辨率						全分辨率
方法	$ {Q}_{8} $	QAVE	SAM(rad)	ERGAS	SCC	PSNR(dB)	$ {D}_{\lambda } $	$ {D}_{S} $	HQNR
BDSD	0.6914	0.7031	4.9777	4.3408	0.9434	36.7286	0.0525	0.1465	0.8078
PRACS	0.6624	0.6677	4.7022	4.8873	0.9121	35.9620	0.0147	0.1145	0.8727
AWFLN	0.9133	0.9143	0.8205	0.6002	0.9914	49.1352	0.0191	0.0933	0.8893
FusionMamba	0.9082	0.9120	0.8584	0.6179	0.9917	49.0684	0.0215	0.0635	0.9164
PanMamba	0.7613	0.7606	2.7455	2.5952	0.9839	41.0233	0.0533	0.0897	0.8621
WFANet	0.7608	0.7599	2.7611	2.6042	0.9838	40.9669	0.0457	0.0821	0.8762
TMDiff	0.7561	0.7548	2.7895	2.6382	0.9831	40.8775	0.0567	0.0533	0.8931
本文方法	0.7675	0.7673	2.6959	2.4934	0.9855	41.3410	0.0154	0.0288	0.9562

下载: 导出CSV

表 4 网络架构消融实验PSNR指标结果(单位:dB)

编号	退化算子	近端网络	多尺度架构	GF-1	QB	WV-2	WV-4
Ⅰ	2D	2D	√	52.4677	50.0497	40.9120	43.5865
Ⅱ	3D	2D	√	52.5387	50.0891	40.9077	43.7566
Ⅲ	3D	3D	√	52.3679	50.1103	41.2707	43.9447
Ⅳ	3D	3D	×	51.7431	50.0722	41.2601	43.7500

下载: 导出CSV

表 5 提示引导机制消融实验PSNR指标结果(单位:dB)

编号	方法	GF-1	QB	WV-2	WV-4
Ⅰ	通道	52.3885	50.1519	41.2963	44.0624
Ⅱ	空间	52.4803	50.1578	41.3295	44.1117
Ⅲ	空间–通道	52.5321	50.2629	41.3410	44.2020
Ⅳ	交叉注意力	51.3079	49.7873	41.0228	43.5290

下载: 导出CSV

参考文献(32)

[1]	THOMAS C, RANCHIN T, WALD L, et al. Synthesis of multispectral images to high spatial resolution: A critical review of fusion methods based on remote sensing physics[J]. IEEE Transactions on Geoscience and Remote Sensing, 2008, 46(5): 1301–1312. doi: 10.1109/TGRS.2007.912448.
[2]	金晶, 王峰. 分布式多卫星协同遥感图像场景分类方法[J]. 电子与信息学报, 2025, 47(12): 4677–4688. doi: 10.11999/JEIT250866. JIN Jing and WANG Feng. A distributed multi-satellite collaborative framework for remote sensing scene classification[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4677–4688. doi: 10.11999/JEIT250866.
[3]	文泓力, 胡庆浩, 黄立威, 等. 基于参数高效ViT与多模态导引的遥感图像小样本分类方法[J]. 电子与信息学报, 2025, 47(12): 4689–4703. doi: 10.11999/JEIT250996. WEN Hongli, HU Qinghao, HUANG Liwei, et al. Few-shot remote sensing image classification based on parameter-efficient vision transformer and multimodal guidance[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4689–4703. doi: 10.11999/JEIT250996.
[4]	韩汶杞, 蒋雯, 耿杰, 等. 原型对齐与拓扑一致性约束下的多模态半监督遥感图像语义分割[J]. 电子与信息学报, 2025, 47(12): 4714–4727. doi: 10.11999/JEIT251115. HAN Wenqi, JIANG Wen, GENG Jie, et al. PATC: Prototype alignment and topology-consistent pseudo-supervision for multimodal semi-supervised semantic segmentation of remote sensing images[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4714–4727. doi: 10.11999/JEIT251115.
[5]	ZENG Delu, HU Yuwen, HUANG Yue, et al. Pan-sharpening with structural consistency and ℓ_1/2 gradient prior[J]. Remote Sensing Letters, 2016, 7(12): 1170–1179. doi: 10.1080/2150704X.2016.1222098.
[6]	WU Zhongcheng, HUANG Tingzhu, DENG Liangjian, et al. VO+Net: An adaptive approach using variational optimization and deep learning for panchromatic sharpening[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5401016. doi: 10.1109/TGRS.2021.3066425.
[7]	LU Hangyuan, YANG Yong, HUANG Shuying, et al. AWFLN: An adaptive weighted feature learning network for pansharpening[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5400815. doi: 10.1109/TGRS.2023.3241643.
[8]	XIE Xinyu, CUI Yawen, TAN Tao, et al. FusionMamba: Dynamic feature enhancement for multimodal image fusion with mamba[J]. Visual Intelligence, 2024, 2(1): 37. doi: 10.1007/s44267-024-00072-9.
[9]	HE Xuanhua, CAO Ke, ZHANG Jie, et al. Pan-mamba: Effective pan-sharpening with state space model[J]. Information Fusion, 2025, 115: 102779. doi: 10.1016/j.inffus.2024.102779.
[10]	HUANG Jie, HUANG Rui, XU Jinghao, et al. Wavelet-assisted multi-frequency attention network for pansharpening[C]. Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 3662–3670. doi: 10.1609/aaai.v39i4.32381.
[11]	JIA Menglin, TANG Luming, CHEN B C, et al. JIA Menglin, TANG Luming, CHEN B C, et al. Visual prompt tuning[C]. 17th European Conference on Computer Vision, Tel-Aviv, Israel, 2022: 709–727. doi: 10.1007/978-3-031-19827-4_41.
[12]	NIE Xing, NI Bolin, CHANG Jianlong, et al. Pro-tuning: Unified prompt tuning for vision tasks[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(6): 4653–4667. doi: 10.1109/TCSVT.2023.3327605.
[13]	CUI Yuning, ZAMIR S W, KHAN S, et al. AdaIR: Adaptive all-in-one image restoration via frequency mining and modulation[C]. The Thirteenth International Conference on Learning Representations, Singapore, Singapore, 2025: 57335–57356. (查阅网上资料, 未找到本条文献页码信息, 请确认).
[14]	ZENG Haijin, WANG Xiangming, CHEN Yongyong, et al. Vision-language gradient descent-driven all-in-one deep unfolding networks[C]. Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, United States, 2025: 7524–7533. doi: 10.1109/CVPR52734.2025.00705.
[15]	YANG Zhiwen, CHEN Haowei, QIAN Ziniu, et al. All-in-one medical image restoration via task-adaptive routing[C]. 27th International Conference on Medical Image Computing and Computer Assisted Intervention, Marrakesh, Morocco, 2024: 67–77. doi: 10.1007/978-3-031-72104-5_7.
[16]	XING Yinghui, QU Litao, ZHANG Shizhou, et al. Empower generalizability for pansharpening through text-modulated diffusion model[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5633812. doi: 10.1109/TGRS.2024.3434431.
[17]	LI Xueheng, HE Xuanhua, CAO Ke, et al. Exploring text-guided information fusion through chain-of-reasoning for pansharpening[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5407314. doi: 10.1109/TGRS.2025.3604447.
[18]	FANG Shijie and GAN Hongping. SSUN-net: Spatial-spectral prior-aware unfolding network for pan-sharpening[C]. Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, United States, 2025: 2897–2905. doi: 10.1609/aaai.v39i3.32296.
[19]	YAN Qiuhai, JIANG Aiwen, CHEN Kang, et al. Textual prompt guided image restoration[J]. Engineering Applications of Artificial Intelligence, 2025, 155: 110981. doi: 10.1016/j.engappai.2025.110981.
[20]	CONDE M V, GEIGLE G, and TIMOFTE R. InstructIR: High-quality image restoration following human instructions[C]. 18th European Conference on Computer Vision, Milan, Italy, 2024: 1–21. doi: 10.1007/978-3-031-72764-1_1.
[21]	ZENG Aohan, XU Bin, WANG Bowen, et al. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools[EB/OL]. https://arxiv.org/abs/2406.12793, 2024.
[22]	XIAO Shitao, LIU Zheng, ZHANG Peitian, et al. C-pack: Packed resources for general Chinese embeddings[C]. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, USA, 2024: 641–649. doi: 10.1145/3626772.3657878.
[23]	MENG Xiangchao, XIONG Yiming, SHAO Feng, et al. A large-scale benchmark data set for evaluating pansharpening performance: Overview and implementation[J]. IEEE Geoscience and Remote Sensing Magazine, 2021, 9(1): 18–52. doi: 10.1109/MGRS.2020.2976696.
[24]	WALD L, RANCHIN T, and MANGOLINI M. Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images[J]. Photogrammetric Engineering and Remote Sensing, 1997, 63(6): 691–699.
[25]	GARZELLI A, NENCINI F, and CAPOBIANCO L. Optimal MMSE pan sharpening of very high resolution multispectral images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2008, 46(1): 228–236. doi: 10.1109/TGRS.2007.907604.
[26]	CHOI J, YU Kiyun, and KIM Y. A new adaptive component-substitution-based satellite image fusion by using partial replacement[J]. IEEE Transactions on Geoscience and Remote Sensing, 2011, 49(1): 295–309. doi: 10.1109/TGRS.2010.2051674.
[27]	VIVONE G, ALPARONE L, CHANUSSOT J, et al. A critical comparison among pansharpening algorithms[J]. IEEE Transactions on Geoscience and Remote Sensing, 2015, 53(5): 2565–2586. doi: 10.1109/TGRS.2014.2361734.
[28]	ZHOU Jie, CIVCO D L, and SILANDER J A. A wavelet transform method to merge Landsat TM and SPOT panchromatic data[J]. International Journal of Remote Sensing, 1998, 19(4): 743–757. doi: 10.1080/014311698215973.
[29]	WALD L. Data Fusion. Definitions and Architectures - Fusion of Images of Different Spatial Resolutions[M]. Paris, France: Presses de l’École, Ecole des Mines de Paris, 2002: 165–189.
[30]	WANG Zhou and BOVIK A C. A universal image quality index[J]. IEEE Signal Processing Letters, 2002, 9(3): 81–84. doi: 10.1109/97.995823.
[31]	GARZELLI A and NENCINI F. Hypercomplex quality assessment of multi/hyperspectral images[J]. IEEE Geoscience and Remote Sensing Letters, 2009, 6(4): 662–665. doi: 10.1109/LGRS.2009.2022650.
[32]	ARIENZO A, VIVONE G, GARZELLI A, et al. Full-resolution quality assessment of pansharpening: Theoretical and hands-on approaches[J]. IEEE Geoscience and Remote Sensing Magazine, 2022, 10(3): 168–201. doi: 10.1109/MGRS.2022.3170092.