Remote Sensing Land-cover Classification Combining Multi-modal and Multi-scale Fusion with Mamba

XIE Wen; ZHU Chaotao; WANG Jin; MA Xiaomeng

doi:10.11999/JEIT251303

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 >

XIE Wen, ZHU Chaotao, WANG Jin, MA Xiaomeng. Remote Sensing Land-cover Classification Combining Multi-modal and Multi-scale Fusion with Mamba[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251303

Citation:

XIE Wen, ZHU Chaotao, WANG Jin, MA Xiaomeng. Remote Sensing Land-cover Classification Combining Multi-modal and Multi-scale Fusion with Mamba[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251303

Citation:

PDF( 6993 KB)

Remote Sensing Land-cover Classification Combining Multi-modal and Multi-scale Fusion with Mamba

doi: 10.11999/JEIT251303 cstr: 32379.14.JEIT251303

1.
School of Communications and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
2.
School of Electronic Engineering, Xi'an University of Electronic Science and Technology, Xi’an 710071, China

Funds: The National Natural Science Foundation of China(61901365, 62071379), The Natural Science Basic Research Plan in Shaanxi Province of China ( 2025JC-YBQN-936), The Scientific Research Program Funded by Education Department of Shaanxi Provincial Government (25JP175), The Youth Innovation Team of Shaanxi Universities, The New Star Team of Xi’an University of Posts and Telecommunication (xyt2016-01)

Received Date: 2025-12-08
Accepted Date: 2026-04-17
Rev Recd Date: 2026-04-17

Available Online: 2026-05-03

Abstract

Abstract

Objective The rapid development of remote sensing imaging has generated large-scale and diverse data for remote sensing land-cover classification. In recent years, Mamba-based models have been successfully applied to image processing because of their distinctive architectures and strong global modeling capability. Among them, multi-scale vision Mamba models are well suited to complex spatial distributions. This property matches remote sensing scenes, in which ground objects often have large scale variations and complex orientations. To fully use the advantages of Mamba in feature extraction and fusion for remote sensing data, a Mamba-based Multi-modal and Multi-scale fusion model for Remote Sensing land-cover classification (M3RS) is proposed. Methods M3RS mainly contains three stages for feature extraction and fusion. First, a Multi-Scale Spatial Encoder based on Spatial Mamba is used to extract features from Light Detection And Ranging (LiDAR) images and Synthetic Aperture Radar (SAR) images. Considering the unique data structure of HyperSpectral Image (HSI), a Multi-Scale Spatio-Spectral Encoder is proposed to extract complex spatio-spectral features by using Spatial Mamba and Spectral Mamba. Next, a Multi-Modal Feature Fusion Module, consisting of the proposed Cross-Mamba and Channel-Concatenated Mamba, is designed to fuse multi-modal features. Cross-Mamba efficiently fuses multi-modal spatial features through the interaction of State Space Model (SSM) parameters from different modalities. Channel-Concatenated Mamba further fuses multi-modal features by constructing four channel scanning strategies. Finally, an improved Multi-Scale Feature Fusion Module is adopted to fuse multi-scale features layer by layer. This design provides highly discriminative features for classification and improves the accuracy of remote sensing land-cover classification. Results and Discussions Comparative experiments are conducted on three publicly available multi-modal remote sensing land-cover classification datasets. The proposed model is compared with seven mainstream models. The results show that M3RS achieves the best Overall Accuracy (OA), Average Accuracy (AA), and Kappa coefficient among all compared methods. On the Muufl dataset, the OA of M3RS is 3.49%, 3.80%, and 4.02% higher than those of representative Convolutional Neural Network (CNN)-, Transformer-, and Mamba-based models, respectively (Table 1, Fig. 8). On the Houston2013 and Augsburg datasets, the OA of M3RS exceeds those of all compared algorithms by an average of 3.37% and 3.11%, respectively (Tables 2 and 3). These results indicate that integrating a multi-modal and multi-scale architecture with Mamba improves the accuracy of remote sensing land-cover classification. In addition, the ablation experiment verifies the contribution of each proposed module to classification performance (Table 4). Spectral Mamba provides a clear accuracy gain, and the fusion modules further improve the overall performance to different degrees. The hyperparameter experiment also provides a useful configuration for multi-scale remote sensing image fusion (Table 5). Compared with a Transformer model using the same multi-scale architecture, M3RS achieves higher classification accuracy, reduces the parameter count by 37.4%, and shortens the training time by 10.7%. These results show that Mamba improves both accuracy and efficiency in this framework (Fig. 9). Conclusions M3RS uses Mamba to fuse multi-modal and multi-scale features, thereby improving remote sensing land-cover classification. The heterogeneous encoders in M3RS address differences among multi-modal data and provide richer complementary information for fusion and classification. Cross-Mamba and Channel-Concatenated Mamba account for both the similarities and differences between Mamba and Transformer. They achieve efficient multi-modal spatial feature interaction and comprehensive multi-modal feature fusion, respectively, forming a hierarchical fusion strategy. The multi-scale architecture also alleviates the difficulty caused by complex spatial distributions of remote sensing land covers. The proposed Multi-Scale Feature Fusion Module, composed of Spatial Mamba and channel attention, integrates multi-scale features and provides a reliable basis for subsequent classification. Future work will further optimize the model by exploring the principles of Mamba and refining feature alignment in cross-attention-based multi-modal interaction, thereby improving the reliability of feature fusion.
- Remote sensing land-cover classification,
- Mamba,
- Multi-modal feature fusion,
- Multi-scale feature fusion

FullText(HTML)

References(32)

References

[1]	李树涛, 李聪妤, 康旭东. 多源遥感图像融合发展现状与未来展望[J]. 遥感学报, 2021, 25(1): 148–166. doi: 10.11834/jrs.20210259. LI Shutao, LI Congyu, and KANG Xudong. Development status and future prospects of multi-source remote sensing image fusion[J]. National Remote Sensing Bulletin, 2021, 25(1): 148–166. doi: 10.11834/jrs.20210259.
[2]	HANG Renlong, LI Zhu, GHAMISI P, et al. Classification of hyperspectral and LiDAR data using coupled CNNs[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(7): 4939–4950. doi: 10.1109/TGRS.2020.2969024.
[3]	REN Bo, HUA Chaoyue, HOU Biao, et al. PDCNet: A Polarimetric data-enhanced contrastive learning network for PolSAR land cover classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025, 18: 10010–10025. doi: 10.1109/JSTARS.2025.3557252.
[4]	REN Bo, WANG Zhao, GE Hanyuan, et al. Incremental land cover classification via soft label and subregion distillation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5647322. doi: 10.1109/TGRS.2025.3615670.
[5]	LI Shutao, SONG Weiwei, FANG Leyuan, et al. Deep learning for hyperspectral image classification: An overview[J]. IEEE Transactions on Geoscience and Remote Sensing, 2019, 57(9): 6690–6709. doi: 10.1109/TGRS.2019.2907932.
[6]	MA Xianping, ZHANG Xiaokang, and PUN M Q. RS³Mamba: Visual state space model for remote sensing image semantic segmentation[J]. IEEE Geoscience and Remote Sensing Letters, 2024, 21: 6011405. doi: 10.1109/LGRS.2024.3414293.
[7]	刘晓敏, 余梦君, 乔振壮, 等. 面向多源遥感数据分类的尺度自适应融合网络[J]. 电子与信息学报, 2024, 46(9): 3693–3702. doi: 10.11999/JEIT240178. LIU Xiaomin, YU Mengjun, QIAO Zhenzhuang, et al. Scale adaptive fusion network for multimodal remote sensing data classification[J]. Journal of Electronics & Information Technology, 2024, 46(9): 3693–3702. doi: 10.11999/JEIT240178.
[8]	廖帝灵, 赖涛, 黄海风, 等. LightMamba: 一种轻量级Mamba用于高光谱图形和激光雷达数据联合分类网络[J]. 电子与信息学报, 2025, 47(12): 4937–4947. doi: 10.11999/JEIT250981. LIAO Diling, LAI Tao, HUANG Haifeng, et al. LightMamba: A lightweight mamba network for the joint classification of HSI and LiDAR data[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4937–4947. doi: 10.11999/JEIT250981.
[9]	刁文辉, 龚铄, 辛林霖, 等. 针对多模态遥感数据的自监督策略模型预训练方法[J]. 电子与信息学报, 2025, 47(6): 1658–1668. doi: 10.11999/JEIT241016. DIAO Wenhui, GONG Shuo, XIN Linlin, et al. A model pre-training method with self-supervised strategies for multimodal remote sensing data[J]. Journal of Electronics & Information Technology, 2025, 47(6): 1658–1668. doi: 10.11999/JEIT241016.
[10]	XUE Zhixiang, TAN Xiong, YU Xuchu, et al. Deep hierarchical vision transformer for hyperspectral and LiDAR data classification[J]. IEEE Transactions on Image Processing, 2022, 31: 3095–3110. doi: 10.1109/TIP.2022.3162964.
[11]	WANG Jinzhe, ZHANG Junping, GUO Qingle, et al. WANG Jinzhe, ZHANG Junping, GUO Qingle, et al. Fusion of hyperspectral and LiDAR data based on dual-branch convolutional neural network[C]. The 2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 2019: 3388–3391. doi: 10.1109/IGARSS.2019.8899332.
[12]	WU Xin, HONG Danfeng, and CHANUSSOT J. Convolutional neural networks for multimodal remote sensing data classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5517010. doi: 10.1109/TGRS.2021.3124913.
[13]	ROY S K, DERIA A, HONG Danfeng, et al. Multimodal fusion transformer for remote sensing image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5515620. doi: 10.1109/TGRS.2023.3286826.
[14]	YAO Jing, ZHANG Bing, LI Chenyu, et al. Extended Vision Transformer (ExViT) for land use and land cover classification: A multimodal deep learning framework[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5514415. doi: 10.1109/TGRS.2023.3284671.
[15]	ZHAO Guangrui, YE Qiaolin, SUN Le, et al. Joint classification of hyperspectral and LiDAR data using a hierarchical CNN and transformer[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5500716. doi: 10.1109/TGRS.2022.3232498.
[16]	ROY S K, SUKUL A, JAMALI A, et al. Cross hyperspectral and LiDAR attention transformer: An extended self-attention for land use and land cover classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5512815. doi: 10.1109/TGRS.2024.3374324.
[17]	SUN Le, WANG Xinyu, ZHENG Yuhui, et al. Multiscale 3-D–2-D mixed CNN and lightweight attention-free transformer for hyperspectral and LiDAR classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 2100116. doi: 10.1109/TGRS.2024.3367374.
[18]	LAPARRA V, MALO J, and CAMPS-VALLS G. Dimensionality reduction via regression in hyperspectral imagery[J]. IEEE Journal of Selected Topics in Signal Processing, 2015, 9(6): 1026–1036. doi: 10.1109/JSTSP.2015.2417833.
[19]	MELGANI F and BRUZZONE L. Support vector machines for classification of hyperspectral remote-sensing images[C]. IEEE International Geoscience and Remote Sensing Symposium, Toronto, Canada, 2002: 506–508. doi: 10.1109/IGARSS.2002.1025088.
[20]	ZHOU Hao, LUO Fulin, ZHUANG Huiping, et al. Attention multihop graph and multiscale convolutional fusion network for hyperspectral image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5508614. doi: 10.1109/TGRS.2023.3265879.
[21]	ZHAO Linying and JI Shunping. CNN, RNN, or VIT? An evaluation of different deep learning architectures for spatio-temporal representation of sentinel time series[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 44–56. doi: 10.1109/JSTARS.2022.3219816.
[22]	LU Ting, DING Kexin, FU Wei, et al. Coupled adversarial learning for fusion classification of hyperspectral and LiDAR data[J]. Information Fusion, 2023, 93: 118–131. doi: 10.1016/j.inffus.2022.12.020.
[23]	XU Xiaodong, LI Wei, RAN Qiong, et al. Multisource remote sensing data classification based on convolutional neural network[J]. IEEE Transactions on Geoscience and Remote Sensing, 2018, 56(2): 937–949. doi: 10.1109/TGRS.2017.2756851.
[24]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
[25]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16 × 16 words: Transformers for image recognition at scale[C]. The 9th International Conference on Learning Representations, Virtual Event, Australia, 2021: 1–21.
[26]	SMITH J T H, WARRINGTON A, and LINDERMAN S W. Simplified state space layers for sequence modeling[C]. The 11th International Conference on Learning Representations, Kigali, Rwanda, 2023: 1–13.
[27]	GU A and DAO T. Mamba: Linear-time sequence modeling with selective state spaces[EB/OL]. https://arxiv.org/abs/2312.00752, 2024.
[28]	ZHU Lianghui, LIAO Bencheng, ZHANG Qian, et al. Vision mamba: Efficient visual representation learning with bidirectional state space model[C]. The 41st International Conference on Machine Learning, Vienna, Austria, 2024.
[29]	LIU Yue, TIAN Yunjie, ZHAO Yuzhong, et al. VMamba: Visual state space model[C]. The 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 3273.
[30]	CHEN Keyan, CHEN Bowen, LIU Chenyang, et al. RSMamba: Remote sensing image classification with state space model[J]. IEEE Geoscience and Remote Sensing Letters, 2024, 21: 8002605. doi: 10.1109/LGRS.2024.3407111.
[31]	LIAO Diling, WANG Qingsong, LAI Tao, et al. Joint classification of hyperspectral and LiDAR data based on mamba[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5530915. doi: 10.1109/TGRS.2024.3459709.
[32]	GAO Feng, JIN Xuepeng, ZHOU Xiaowei, et al. MSFMamba: Multiscale feature fusion state space model for multisource remote sensing image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5504116. doi: 10.1109/TGRS.2025.3535622.