Entropy-driven Adaptive Fusion Network for Scene Classification of High-Resolution Remote Sensing Images

SONG Wanying; LIU Yuchen; WANG Jie; WANG Anyi

doi:10.11999/JEIT251147

Volume 48 Issue 5

May 2026

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2026 > 48(5): 2031-2040

SONG Wanying, LIU Yuchen, WANG Jie, WANG Anyi. Entropy-driven Adaptive Fusion Network for Scene Classification of High-Resolution Remote Sensing Images[J]. Journal of Electronics & Information Technology, 2026, 48(5): 2031-2040. doi: 10.11999/JEIT251147

Citation:

SONG Wanying, LIU Yuchen, WANG Jie, WANG Anyi. Entropy-driven Adaptive Fusion Network for Scene Classification of High-Resolution Remote Sensing Images[J]. Journal of Electronics & Information Technology, 2026, 48(5): 2031-2040. doi: 10.11999/JEIT251147

Citation:

PDF( 4601 KB)

Entropy-driven Adaptive Fusion Network for Scene Classification of High-Resolution Remote Sensing Images

doi: 10.11999/JEIT251147 cstr: 32379.14.JEIT251147

School of Communication and Information Engineering, Xi’an University of Science and Technology, Xi’an, 710054, China

Funds: The National Natural Science Foundation of China (61901358), The Natural Science Basic Research Plan in Shaanxi Province of China (2025JC-YBMS-701), The Outstanding Youth Science Fund of Xi’an University of Science and Technology (2020YQ3-09), China Postdoctoral Science Foundation (2020M673347)

Received Date: 2025-11-01
Accepted Date: 2026-03-03
Rev Recd Date: 2026-02-25

Available Online: 2026-03-15

Publish Date: 2026-05-30

Abstract

Abstract

Objective Remote sensing image scene classification is intended to assign semantic labels to aerial or satellite images. With the rapid development of Earth observation technologies, high-resolution remote sensing images provide abundant detail but also present major challenges, including complex spatial structures, large scale variations, high intra-class variance, and strong inter-class similarity. Traditional Convolutional Neural Networks (CNNs) have achieved notable success in local spatial modeling, but they cannot adequately capture long-range dependencies because of their fixed receptive fields. To address this limitation, CNN-Transformer hybrid architectures have been proposed to balance local detail and global semantics. However, these models usually adopt simple concatenation for multi-scale feature fusion, which introduces redundancy and reduces discriminability. In addition, although the Swin Transformer uses window-based self-attention to capture contextual information, it still shows clear limitations in the analysis of complex high-resolution images. Specifically, long-range dependency modeling across windows is constrained by the fixed window size. The extraction of fine-grained local features is also limited because deep networks tend to overlook crucial fine-texture information from low- and mid-level features. Moreover, existing multi-level feature fusion strategies lack semantic guidance and therefore readily introduce background noise. Therefore, a network that can balance global contextual modeling and local discriminability while enabling adaptive fusion is still needed. Methods To address limited cross-window interaction and the absence of semantic guidance in multi-level feature fusion, an Entropy-driven Adaptive Fusion Swin Transformer (E-AF-ST) network is proposed. The architecture uses a lightweight Swin-Tiny backbone and incorporates two key modules: the Attention-guided region Selection and feature Optimization module (ASO) and the Entropy-driven Gated Fusion module (EGF) (Fig. 1). The ASO module addresses weak cross-window interaction and insufficient fine-grained feature extraction in the Swin Transformer through three consecutive stages (Fig. 2a). First, cross-window sparse attention is computed to remove physical window boundaries. By enlarging the patch partition size, sparse attention is applied to the entire image sequence, allowing global contextual correlations across the whole image to be captured. Second, dynamic region selection is performed. On the basis of pixel-level entropy measurement, a multilayer perceptron maps entropy features to attention scores, and a Top-k masking strategy dynamically selects the most informative discriminative regions. Third, recursive feature optimization is performed. Multi-head self-attention and layer normalization are applied at the local scale to progressively enhance boundaries and microstructural information. The EGF module then integrates the Swin Transformer output features, the globally enhanced contextual features, and the locally optimized features to reduce semantic discrepancies (Fig. 2b). First, energy normalization is performed using the Frobenius norm to obtain a probabilistic energy distribution. Next, an entropy-driven gated fusion mechanism calculates the Shannon entropy for each branch. A learnable soft-normalization gating function then maps the entropy information to normalized fusion weights, automatically reducing the weight of branches with high entropy caused by cluttered backgrounds. Finally, the fused representations undergo lightweight recursive optimization using depthwise separable convolutions and GELU activation functions with residual connections to suppress redundant information. The forward propagation process is systematically summarized in Algorithm 1. Results and Discussions To validate the discriminative capability of the proposed network, extensive experiments were conducted on two widely used public datasets, AID and NWPU-RESISC45. The proposed E-AF-ST network shows superior classification performance compared with existing advanced methods (Table 1). On the AID dataset, the model achieves state-of-the-art overall accuracies of 95.56% and 97.21% at training ratios of 20% and 50%, respectively. On the challenging NWPU-RESISC45 dataset, it achieves the highest accuracies of 92.45% and 94.59% at training ratios of 10% and 20%, respectively. The confusion matrices show that the recognition accuracy of most categories exceeds 95% (Figs. 3, 4), and the misclassification proportions for classes with complex backgrounds are significantly lower than those of the baseline model (Table 2). Visual analysis based on Grad-CAM further confirms the advantages of the E-AF-ST network in global contextual modeling and critical region selection. Compared with the Swin-Tiny baseline, the proposed network demonstrates more precise semantic focus (Fig. 5). In “airport” and “port” scenes, background noise is effectively suppressed and key targets are accurately highlighted. In structurally complex scenes such as “viaducts" and “railway stations”, extension directions and texture characteristics are comprehensively captured. Ablation experiments confirm that the cross-window sparse attention in the ASO module and the dynamic weight allocation in the EGF module are highly complementary. Furthermore, this performance gain is achieved with only a minimal increase in model complexity, with a total of 30.45M parameters and 4.72G- FLOPs. Conclusions An E-AF-ST network is proposed to address insufficient extraction of local discriminative information, cross-scale feature inconsistency, and semantic redundancy in high-resolution remote sensing image scene classification. With information entropy used as a guiding metric, the ASO module enables precise selection and recursive optimization of discriminative regions, whereas the EGF module achieves adaptive and redundancy-reduced integration of multi-source features. Experimental and visual results show that the proposed method effectively reduces interference from complex backgrounds and outperforms existing mainstream CNN-Transformer hybrid architectures. This study provides a new theoretical perspective and technical route for multi-scale target perception and feature semantic alignment.
- Remote sensing image scene classification,
- Entropy,
- Feature fusion,
- Swin Transformer network

FullText(HTML)

References(22)

References

[1]	李大湘, 南艺璇, 刘颖. 面向遥感图像场景分类的双知识蒸馏模型[J]. 电子与信息学报, 2023, 45(10): 3558–3567. doi: 10.11999/JEIT221017. LI Daxiang, NAN Yixuan, and LIU Ying. A double knowledge distillation model for remote sensing image scene classification[J]. Journal of Electronics & Information Technology, 2023, 45(10): 3558–3567. doi: 10.11999/JEIT221017.
[2]	吴倩倩, 倪康, 郑志忠. 基于双阶段高阶Transformer的遥感图像场景分类[J]. 遥感学报, 2025, 29(3): 792–807. doi: 10.11834/jrs.20233332. WU Qianqian, NI Kang, and ZHENG Zhizhong. Remote sensing image scene classification on the basis of a two-stage high-order Transformer[J]. National Remote Sensing Bulletin, 2025, 29(3): 792–807. doi: 10.11834/jrs.20233332.
[3]	CHEN Jianlai, XIONG Rongqi, YU Hanwen, et al. Microwave photonic synthetic aperture radar: Systems, experiments, and imaging processing[J]. IEEE Geoscience and Remote Sensing Magazine, 2025, 13(2): 314–328. doi: 10.1109/MGRS.2024.3444777.
[4]	尹文昕, 于海琛, 刁文辉, 等. 遥感场景理解中视觉Transformer的参数高效微调[J]. 电子与信息学报, 2024, 46(9): 3731–3738. doi: 10.11999/JEIT240218. YIN Wenxin, YU Haichen, DIAO Wenhui, et al. Parameter efficient fine-tuning of vision transformers for remote sensing scene understanding[J]. Journal of Electronics & Information Technology, 2024, 46(9): 3731–3738. doi: 10.11999/JEIT240218.
[5]	CHENG Gong, HAN Junwei, and LU Xiaoqiang. Remote sensing image scene classification: Benchmark and state of the art[J]. Proceedings of the IEEE, 2017, 105(10): 1865–1883. doi: 10.1109/JPROC.2017.2675998.
[6]	HOU Yan’e, YANG Kang, DANG Lanxue, et al. Contextual spatial-channel attention network for remote sensing scene classification[J]. IEEE Geoscience and Remote Sensing Letters, 2023, 20: 6008805. doi: 10.1109/LGRS.2023.3304645.
[7]	SHI Jiacheng, LIU Wei, SHAN Haoyu, et al. Remote sensing scene classification based on multibranch fusion attention network[J]. IEEE Geoscience and Remote Sensing Letters, 2023, 20: 3001505. doi: 10.1109/LGRS.2023.3262407.
[8]	PAN Wenwen, SUN Xiaofei, WANG Yilun, et al. Enhanced photovoltaic panel defect detection via adaptive complementary fusion in YOLO-ACF[J]. Scientific Reports, 2024, 14(1): 26425. doi: 10.1038/s41598-024-75772-9.
[9]	徐从安, 吕亚飞, 张筱晗, 等. 基于双重注意力机制的遥感图像场景分类特征表示方法[J]. 电子与信息学报, 2021, 43(3): 683–691. doi: 10.11999/JEIT200568. XU Congan, LÜ Yafei, ZHANG Xiaohan, et al. A discriminative feature representation method based on dual attention mechanism for remote sensing image scene classification[J]. Journal of Electronics & Information Technology, 2021, 43(3): 683–691. doi: 10.11999/JEIT200568.
[10]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. The 9th International Conference on Learning Representations (ICLR), 2021: 1–21.
[11]	SONG Jiayin, FAN Yiming, SONG Wenlong, et al. SwinHCST: A deep learning network architecture for scene classification of remote sensing images based on improved CNN and transformer[J]. International Journal of Remote Sensing, 2023, 44(23): 7439–7463. doi: 10.1080/01431161.2023.2285739.
[12]	HUANG Xinyan, LIU Fang, CUI Yuanhao, et al. Faster and better: A lightweight transformer network for remote sensing scene classification[J]. Remote Sensing, 2023, 15(14): 3645. doi: 10.3390/rs15143645.
[13]	LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]. The 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, 2021: 10012–10022. doi: 10.1109/ICCV48922.2021.00986.
[14]	JANNAT F E and WILLIS A R. Improving classification of remotely sensed images with the Swin transformer[C]. SoutheastCon 2022, Mobile, USA, 2022: 611–618. doi: 10.1109/SoutheastCon48659.2022.9764016.
[15]	CHANG Jing, HE Xiaohui, SONG Dingjun, et al. A multi-scale attention network for building extraction from high-resolution remote sensing images[J]. Scientific Reports, 2025, 15(1): 24938. doi: 10.1038/s41598-025-09086-9.
[16]	YE Zhipin, LIU Yingqian, JING Teng, et al. A high-resolution network with strip attention for retinal vessel segmentation[J]. Sensors, 2023, 23(21): 8899. doi: 10.3390/s23218899.
[17]	XIA Guisong, HU Jingwen, HU Fan, et al. AID: A benchmark data set for performance evaluation of aerial scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(7): 3965–3981. doi: 10.1109/TGRS.2017.2685945.
[18]	ZHAO Zhicheng, LI Jiaqi, LUO Ze, et al. Remote sensing image scene classification based on an enhanced attention module[J]. IEEE Geoscience and Remote Sensing Letters, 2021, 18(11): 1926–1930. doi: 10.1109/LGRS.2020.3011405.
[19]	WANG Junjie, LI Wei, ZHANG Mengmeng, et al. Remote-sensing scene classification via multistage self-guided separation network[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5615312. doi: 10.1109/TGRS.2023.3295797.
[20]	YANG Yuqun, TANG Xu, CHEUNG Y M, et al. SAGN: Semantic-aware graph network for remote sensing scene classification[J]. IEEE Transactions on Image Processing, 2023, 32: 1011–1025. doi: 10.1109/TIP.2023.3238310.
[21]	TANG Xu, LI Mingteng, MA Jingjing, et al. EMTCAL: Efficient multiscale transformer and cross-level attention learning for remote sensing scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5626915. doi: 10.1109/TGRS.2022.3194505.
[22]	TANG Xu, MA Qinshuo, ZHANG Xiangrong, et al. Attention consistent network for remote sensing scene classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 2030–2045. doi: 10.1109/JSTARS.2021.3051569.