结合多模态多尺度融合与Mamba的遥感地物分类

谢雯; 朱潮涛; 王瑾; 马晓萌

doi:10.11999/JEIT251303

结合多模态多尺度融合与Mamba的遥感地物分类

doi: 10.11999/JEIT251303 cstr: 32379.14.JEIT251303

谢雯^1, ,,
朱潮涛¹,
王瑾¹,
马晓萌²

1.
西安邮电大学通信与信息工程学院西安 710121
2.
西安电子科技大学电子工程学院西安 710071

基金项目: 国家自然科学基金(61901365, 62071379)，陕西省自然科学基础研究计划(2025JC-YBQN-936)，陕西省教育厅青年创新团队科研计划项目(25JP175)，陕西高校青年创新团队，西安邮电大学西邮新星团队项目(xyt2016-01)

详细信息

作者简介:
谢雯：女，副教授，研究方向为遥感图像处理、深度学习、机器学习

朱潮涛：男，硕士生，研究方向为深度学习和遥感图像分类

王瑾：女，副研究员，研究方向为卫星导航信号处理、导航与通信融合定位、边缘计算及高精度时间同步

马晓萌：女，高级工程师，研究方向为雷达对抗系统设计及信号处理

通讯作者:
谢雯　xiewen@xupt.edu.cn

中图分类号: TN911.73; TP751.1
计量
- 文章访问数: 271
- HTML全文浏览量: 112
- PDF下载量: 96
- 被引次数: 0
出版历程
- 收稿日期: 2025-12-08
- 修回日期: 2026-04-17
- 录用日期: 2026-04-17
- 网络出版日期: 2026-05-03

Remote Sensing Land-cover Classification Combining Multi-modal and Multi-scale Fusion with Mamba

1.
School of Communications and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
2.
School of Electronic Engineering, Xi'an University of Electronic Science and Technology, Xi’an 710071, China

Funds: The National Natural Science Foundation of China(61901365, 62071379), The Natural Science Basic Research Plan in Shaanxi Province of China ( 2025JC-YBQN-936), The Scientific Research Program Funded by Education Department of Shaanxi Provincial Government (25JP175), The Youth Innovation Team of Shaanxi Universities, The New Star Team of Xi’an University of Posts and Telecommunication (xyt2016-01)

摘要

摘要: 遥感成像技术的迅速发展为遥感地物分类带来了海量且多元的数据，如何利用多模态数据的互补性提升分类性能成为研究热点。近年来Mamba模型凭借其独特的架构与强大的全局建模能力在图像处理领域得到成功应用，其中多尺度视觉Mamba模型善于应对复杂的空间分布，契合遥感地物尺度差异大、朝向复杂等特点。为充分发挥Mamba模型提取与融合遥感数据特征的优势，该文提出基于Mamba的多模态多尺度融合模型用于遥感地物分类(M3RS)。首先，该模型采用多尺度空间编码器提取光探测与测距(LiDAR)图像和合成孔径雷达(SAR)图像的特征，并基于高光谱图像(HSI)独特的数据结构，提出多尺度空谱编码器提取其复杂的空间光谱特征。然后提出由交叉Mamba和通道拼接Mamba相结合得到的多模态特征融合模块，其中交叉Mamba通过交互状态空间参数高效融合多模态空间特征，通道拼接Mamba通过构造四种通道扫描方式充分融合多模态特征。最后该模型采用改进的多尺度特征融合模块逐层融合多尺度特征并提取具有高判别性的分类依据，可有效提升遥感地物分类的准确率。该文通过在Muufl, Houston2013和Augsburg 3个数据集上展开的分类实验验证了该分类模型M3RS的有效性。
- 遥感地物分类 /
- Mamba /
- 多模态特征融合 /
- 多尺度特征融合
Abstract: Objective The rapid development of remote sensing imaging has generated large-scale and diverse data for remote sensing land-cover classification. In recent years, Mamba-based models have been successfully applied to image processing because of their distinctive architectures and strong global modeling capability. Among them, multi-scale vision Mamba models are well suited to complex spatial distributions. This property matches remote sensing scenes, in which ground objects often have large scale variations and complex orientations. To fully use the advantages of Mamba in feature extraction and fusion for remote sensing data, a Mamba-based Multi-modal and Multi-scale fusion model for Remote Sensing land-cover classification (M3RS) is proposed. Methods M3RS mainly contains three stages for feature extraction and fusion. First, a Multi-Scale Spatial Encoder based on Spatial Mamba is used to extract features from Light Detection And Ranging (LiDAR) images and Synthetic Aperture Radar (SAR) images. Considering the unique data structure of HyperSpectral Image (HSI), a Multi-Scale Spatio-Spectral Encoder is proposed to extract complex spatio-spectral features by using Spatial Mamba and Spectral Mamba. Next, a Multi-Modal Feature Fusion Module, consisting of the proposed Cross-Mamba and Channel-Concatenated Mamba, is designed to fuse multi-modal features. Cross-Mamba efficiently fuses multi-modal spatial features through the interaction of State Space Model (SSM) parameters from different modalities. Channel-Concatenated Mamba further fuses multi-modal features by constructing four channel scanning strategies. Finally, an improved Multi-Scale Feature Fusion Module is adopted to fuse multi-scale features layer by layer. This design provides highly discriminative features for classification and improves the accuracy of remote sensing land-cover classification. Results and Discussions Comparative experiments are conducted on three publicly available multi-modal remote sensing land-cover classification datasets. The proposed model is compared with seven mainstream models. The results show that M3RS achieves the best Overall Accuracy (OA), Average Accuracy (AA), and Kappa coefficient among all compared methods. On the Muufl dataset, the OA of M3RS is 3.49%, 3.80%, and 4.02% higher than those of representative Convolutional Neural Network (CNN)-, Transformer-, and Mamba-based models, respectively (Table 1, Fig. 8). On the Houston2013 and Augsburg datasets, the OA of M3RS exceeds those of all compared algorithms by an average of 3.37% and 3.11%, respectively (Tables 2 and 3). These results indicate that integrating a multi-modal and multi-scale architecture with Mamba improves the accuracy of remote sensing land-cover classification. In addition, the ablation experiment verifies the contribution of each proposed module to classification performance (Table 4). Spectral Mamba provides a clear accuracy gain, and the fusion modules further improve the overall performance to different degrees. The hyperparameter experiment also provides a useful configuration for multi-scale remote sensing image fusion (Table 5). Compared with a Transformer model using the same multi-scale architecture, M3RS achieves higher classification accuracy, reduces the parameter count by 37.4%, and shortens the training time by 10.7%. These results show that Mamba improves both accuracy and efficiency in this framework (Fig. 9). Conclusions M3RS uses Mamba to fuse multi-modal and multi-scale features, thereby improving remote sensing land-cover classification. The heterogeneous encoders in M3RS address differences among multi-modal data and provide richer complementary information for fusion and classification. Cross-Mamba and Channel-Concatenated Mamba account for both the similarities and differences between Mamba and Transformer. They achieve efficient multi-modal spatial feature interaction and comprehensive multi-modal feature fusion, respectively, forming a hierarchical fusion strategy. The multi-scale architecture also alleviates the difficulty caused by complex spatial distributions of remote sensing land covers. The proposed Multi-Scale Feature Fusion Module, composed of Spatial Mamba and channel attention, integrates multi-scale features and provides a reliable basis for subsequent classification. Future work will further optimize the model by exploring the principles of Mamba and refining feature alignment in cross-attention-based multi-modal interaction, thereby improving the reliability of feature fusion.
- Remote sensing land-cover classification /
- Mamba /
- Multi-modal feature fusion /
- Multi-scale feature fusion

HTML全文

图 1 Mamba模型

下载: 全尺寸图片幻灯片

图 2 M3RS流程图

下载: 全尺寸图片幻灯片

图 3 空间Mamba

下载: 全尺寸图片幻灯片

图 4 谱Mamba

下载: 全尺寸图片幻灯片

图 5 交叉SS2D

下载: 全尺寸图片幻灯片

图 6 通道拼接SS1D

下载: 全尺寸图片幻灯片

图 7 多尺度特征融合模块

下载: 全尺寸图片幻灯片

图 8 Muufl数据集分类结果可视化

下载: 全尺寸图片幻灯片

图 9 Houston2013数据集Mamba与Transformer架构对比

下载: 全尺寸图片幻灯片

表 1 Muufl数据集分类结果对比(%)

类别(训练/测试)	CCRNet	MFT	ExVit	HCT	M2FNet	Cross-HL	MSFMamba	M3RS
1：树木(150/23096)	89.97	87.90	89.59	92.20	90.92	91.39	88.51	92.02
2：草地(150/4120)	79.30	75.27	79.42	76.07	72.84	85.70	82.14	87.26
3：混合地表(150/6732)	80.50	76.00	79.63	84.51	82.13	84.21	80.21	85.01
4：土壤沙地(150/1676)	94.09	94.45	95.29	96.78	96.30	96.96	94.57	97.85
5：公路(150/6537)	87.33	78.75	76.89	89.02	85.07	90.12	87.30	93.21
6：水体(150/316)	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
7：建筑阴影(150/2083)	87.47	91.55	89.34	88.57	92.89	91.21	90.64	92.17
8：建筑(150/6090)	96.31	95.14	93.35	97.11	94.98	95.53	95.50	97.34
9：人行道(150/1235)	77.17	60.32	66.64	76.44	73.28	82.02	72.06	85.26
10：黄色路缘(150/33)	96.97	93.94	100.00	90.91	100.00	100.00	93.94	96.97
11：防护布(150/119)	97.48	99.16	99.16	99.16	97.48	99.16	99.16	99.16
OA	88.12	84.86	86.06	89.79	88.00	90.36	87.59	91.61
AA	89.69	86.59	88.12	90.07	89.63	92.39	89.46	93.30
Kappa	84.47	80.34	81.82	86.57	84.27	87.33	83.81	88.96

下载: 导出CSV

表 2 Houston2013数据集分类结果对比(%)

类别(训练/测试)	CCRNet	MFT	ExVit	HCT	M2FNet	Cross-HL	MSFMamba	M3RS
1：健康草地(198/1053)	72.74	76.54	79.39	75.50	82.28	76.54	80.06	76.92
2：压力草地(190/1064)	83.93	93.33	77.91	95.96	87.41	85.15	98.59	96.15
3：人工草地(192/505)	91.88	98.02	97.82	98.22	95.25	97.82	96.63	88.32
4：树木(188/1056)	89.30	89.96	87.97	81.72	92.90	88.92	93.28	89.68
5：土壤(186/1056)	100.00	99.91	99.62	100.00	98.01	100.00	100.00	100.00
6：水体(182/143)	95.80	95.80	95.80	95.80	95.80	95.8	100.00	100.00
7：住宅区(196/1072)	72.48	82.65	87.03	71.83	78.92	76.77	85.63	73.13
8：商业区(191/1053)	84.43	77.40	96.68	93.54	92.21	74.55	93.45	94.4
9：道路(193/1059)	84.32	88.20	79.89	89.52	81.49	77.43	78.00	83.95
10：高速公路(193/1059)	63.71	70.85	65.54	66.51	78.09	68.53	59.65	96.81
11：铁路(181/1054)	99.15	93.74	95.64	96.39	97.91	96.11	83.11	96.02
12：停车场1(192/1041)	97.50	98.17	98.56	98.75	94.43	100.00	95.00	99.52
13：停车场2(184/285)	70.88	76.14	76.84	84.56	82.81	72.28	82.46	80.70
14：网球场(181/247)	100.00	100.00	100.00	100.00	99.19	100.00	100.00	100.00
15：跑道(187/473)	99.79	99.37	99.15	100.00	100.00	100.00	100.00	100.00
OA	85.75	88.13	87.91	88.26	89.28	85.73	87.97	90.95
AA	87.06	89.34	89.19	89.89	90.46	87.33	89.72	91.71
Kappa	84.52	87.10	86.87	87.25	88.36	84.49	86.94	90.17

下载: 导出CSV

表 3 Augsburg数据集分类结果对比(%)

类别(训练/测试)	CCRNet	MFT	ExVit	HCT	M2FNet	Cross-HL	MSFMamba	M3RS
1：森林(146/13361)	93.51	88.90	93.65	96.16	94.78	92.44	94.93	95.39
2：住宅区(264/30065)	99.04	97.43	95.92	99.24	97.38	97.95	99.49	99.01
3：工业区(21/3830)	66.11	33.99	40.26	38.22	21.57	61.85	3.66	69.43
4：低矮植物(248/26609)	92.37	87.50	91.68	93.21	91.09	89.97	94.51	92.82
5：待开发地(52/523)	61.95	51.05	48.95	64.63	37.48	53.73	30.59	58.51
6：商业区(7/1638)	9.52	12.76	14.22	5.19	1.89	3.72	0.18	7.63
7：水域(23/1507)	48.97	37.62	17.39	47.91	11.75	51.09	13.01	48.64
OA	91.05	86.15	87.75	90.41	86.94	89.28	88.02	91.62
AA	67.35	58.47	57.44	63.51	50.85	64.39	48.05	67.35
Kappa	87.17	80.05	82.26	86.04	80.75	84.68	82.04	87.94

下载: 导出CSV

表 4 Houston2013数据集上的消融实验(%)

模块	OA	AA	Kappa
仅空间Mamba	86.87	89.00	85.73
添加谱Mamba	89.07	90.75	88.13
添加交叉Mamba	89.28	90.68	88.36
添加通道拼接Mamba	90.14	91.47	89.29
添加多尺度特征融合模块	90.95	91.71	90.17

下载: 导出CSV

表 5 Houston2013数据集上空间Mamba层数和特征维度的超参数实验(%)

VSSBlock层数	特征维度	OA	AA	Kappa
(2,2,9)	(64,128,256)	90.95	91.71	90.17
(2,2,9,2)	(64,128,256,512)	89.52	91.00	88.62
(2,2,27)	(64,128,256)	89.88	91.35	89.01
(2,2,9)	(96,192,384)	87.29	88.85	86.20

下载: 导出CSV

参考文献(32)

[1]	李树涛, 李聪妤, 康旭东. 多源遥感图像融合发展现状与未来展望[J]. 遥感学报, 2021, 25(1): 148–166. doi: 10.11834/jrs.20210259. LI Shutao, LI Congyu, and KANG Xudong. Development status and future prospects of multi-source remote sensing image fusion[J]. National Remote Sensing Bulletin, 2021, 25(1): 148–166. doi: 10.11834/jrs.20210259.
[2]	HANG Renlong, LI Zhu, GHAMISI P, et al. Classification of hyperspectral and LiDAR data using coupled CNNs[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(7): 4939–4950. doi: 10.1109/TGRS.2020.2969024.
[3]	REN Bo, HUA Chaoyue, HOU Biao, et al. PDCNet: A Polarimetric data-enhanced contrastive learning network for PolSAR land cover classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025, 18: 10010–10025. doi: 10.1109/JSTARS.2025.3557252.
[4]	REN Bo, WANG Zhao, GE Hanyuan, et al. Incremental land cover classification via soft label and subregion distillation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5647322. doi: 10.1109/TGRS.2025.3615670.
[5]	LI Shutao, SONG Weiwei, FANG Leyuan, et al. Deep learning for hyperspectral image classification: An overview[J]. IEEE Transactions on Geoscience and Remote Sensing, 2019, 57(9): 6690–6709. doi: 10.1109/TGRS.2019.2907932.
[6]	MA Xianping, ZHANG Xiaokang, and PUN M Q. RS³Mamba: Visual state space model for remote sensing image semantic segmentation[J]. IEEE Geoscience and Remote Sensing Letters, 2024, 21: 6011405. doi: 10.1109/LGRS.2024.3414293.
[7]	刘晓敏, 余梦君, 乔振壮, 等. 面向多源遥感数据分类的尺度自适应融合网络[J]. 电子与信息学报, 2024, 46(9): 3693–3702. doi: 10.11999/JEIT240178. LIU Xiaomin, YU Mengjun, QIAO Zhenzhuang, et al. Scale adaptive fusion network for multimodal remote sensing data classification[J]. Journal of Electronics & Information Technology, 2024, 46(9): 3693–3702. doi: 10.11999/JEIT240178.
[8]	廖帝灵, 赖涛, 黄海风, 等. LightMamba: 一种轻量级Mamba用于高光谱图形和激光雷达数据联合分类网络[J]. 电子与信息学报, 2025, 47(12): 4937–4947. doi: 10.11999/JEIT250981. LIAO Diling, LAI Tao, HUANG Haifeng, et al. LightMamba: A lightweight mamba network for the joint classification of HSI and LiDAR data[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4937–4947. doi: 10.11999/JEIT250981.
[9]	刁文辉, 龚铄, 辛林霖, 等. 针对多模态遥感数据的自监督策略模型预训练方法[J]. 电子与信息学报, 2025, 47(6): 1658–1668. doi: 10.11999/JEIT241016. DIAO Wenhui, GONG Shuo, XIN Linlin, et al. A model pre-training method with self-supervised strategies for multimodal remote sensing data[J]. Journal of Electronics & Information Technology, 2025, 47(6): 1658–1668. doi: 10.11999/JEIT241016.
[10]	XUE Zhixiang, TAN Xiong, YU Xuchu, et al. Deep hierarchical vision transformer for hyperspectral and LiDAR data classification[J]. IEEE Transactions on Image Processing, 2022, 31: 3095–3110. doi: 10.1109/TIP.2022.3162964.
[11]	WANG Jinzhe, ZHANG Junping, GUO Qingle, et al. WANG Jinzhe, ZHANG Junping, GUO Qingle, et al. Fusion of hyperspectral and LiDAR data based on dual-branch convolutional neural network[C]. The 2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 2019: 3388–3391. doi: 10.1109/IGARSS.2019.8899332.
[12]	WU Xin, HONG Danfeng, and CHANUSSOT J. Convolutional neural networks for multimodal remote sensing data classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5517010. doi: 10.1109/TGRS.2021.3124913.
[13]	ROY S K, DERIA A, HONG Danfeng, et al. Multimodal fusion transformer for remote sensing image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5515620. doi: 10.1109/TGRS.2023.3286826.
[14]	YAO Jing, ZHANG Bing, LI Chenyu, et al. Extended Vision Transformer (ExViT) for land use and land cover classification: A multimodal deep learning framework[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5514415. doi: 10.1109/TGRS.2023.3284671.
[15]	ZHAO Guangrui, YE Qiaolin, SUN Le, et al. Joint classification of hyperspectral and LiDAR data using a hierarchical CNN and transformer[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5500716. doi: 10.1109/TGRS.2022.3232498.
[16]	ROY S K, SUKUL A, JAMALI A, et al. Cross hyperspectral and LiDAR attention transformer: An extended self-attention for land use and land cover classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5512815. doi: 10.1109/TGRS.2024.3374324.
[17]	SUN Le, WANG Xinyu, ZHENG Yuhui, et al. Multiscale 3-D–2-D mixed CNN and lightweight attention-free transformer for hyperspectral and LiDAR classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 2100116. doi: 10.1109/TGRS.2024.3367374.
[18]	LAPARRA V, MALO J, and CAMPS-VALLS G. Dimensionality reduction via regression in hyperspectral imagery[J]. IEEE Journal of Selected Topics in Signal Processing, 2015, 9(6): 1026–1036. doi: 10.1109/JSTSP.2015.2417833.
[19]	MELGANI F and BRUZZONE L. Support vector machines for classification of hyperspectral remote-sensing images[C]. IEEE International Geoscience and Remote Sensing Symposium, Toronto, Canada, 2002: 506–508. doi: 10.1109/IGARSS.2002.1025088.
[20]	ZHOU Hao, LUO Fulin, ZHUANG Huiping, et al. Attention multihop graph and multiscale convolutional fusion network for hyperspectral image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5508614. doi: 10.1109/TGRS.2023.3265879.
[21]	ZHAO Linying and JI Shunping. CNN, RNN, or VIT? An evaluation of different deep learning architectures for spatio-temporal representation of sentinel time series[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 44–56. doi: 10.1109/JSTARS.2022.3219816.
[22]	LU Ting, DING Kexin, FU Wei, et al. Coupled adversarial learning for fusion classification of hyperspectral and LiDAR data[J]. Information Fusion, 2023, 93: 118–131. doi: 10.1016/j.inffus.2022.12.020.
[23]	XU Xiaodong, LI Wei, RAN Qiong, et al. Multisource remote sensing data classification based on convolutional neural network[J]. IEEE Transactions on Geoscience and Remote Sensing, 2018, 56(2): 937–949. doi: 10.1109/TGRS.2017.2756851.
[24]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
[25]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16 × 16 words: Transformers for image recognition at scale[C]. The 9th International Conference on Learning Representations, Virtual Event, Australia, 2021: 1–21.
[26]	SMITH J T H, WARRINGTON A, and LINDERMAN S W. Simplified state space layers for sequence modeling[C]. The 11th International Conference on Learning Representations, Kigali, Rwanda, 2023: 1–13.
[27]	GU A and DAO T. Mamba: Linear-time sequence modeling with selective state spaces[EB/OL]. https://arxiv.org/abs/2312.00752, 2024.
[28]	ZHU Lianghui, LIAO Bencheng, ZHANG Qian, et al. Vision mamba: Efficient visual representation learning with bidirectional state space model[C]. The 41st International Conference on Machine Learning, Vienna, Austria, 2024.
[29]	LIU Yue, TIAN Yunjie, ZHAO Yuzhong, et al. VMamba: Visual state space model[C]. The 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 3273.
[30]	CHEN Keyan, CHEN Bowen, LIU Chenyang, et al. RSMamba: Remote sensing image classification with state space model[J]. IEEE Geoscience and Remote Sensing Letters, 2024, 21: 8002605. doi: 10.1109/LGRS.2024.3407111.
[31]	LIAO Diling, WANG Qingsong, LAI Tao, et al. Joint classification of hyperspectral and LiDAR data based on mamba[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5530915. doi: 10.1109/TGRS.2024.3459709.
[32]	GAO Feng, JIN Xuepeng, ZHOU Xiaowei, et al. MSFMamba: Multiscale feature fusion state space model for multisource remote sensing image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5504116. doi: 10.1109/TGRS.2025.3535622.