Advanced Search
Turn off MathJax
Article Contents
SONG Wanying, LIU Yuchen, WANG Jie, WANG Anyi. Construction and Scene Classification Research of Entropy-Driven Adaptive Fusion Networks for High-Resolution Remote Sensing Images[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251147
Citation: SONG Wanying, LIU Yuchen, WANG Jie, WANG Anyi. Construction and Scene Classification Research of Entropy-Driven Adaptive Fusion Networks for High-Resolution Remote Sensing Images[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251147

Construction and Scene Classification Research of Entropy-Driven Adaptive Fusion Networks for High-Resolution Remote Sensing Images

doi: 10.11999/JEIT251147 cstr: 32379.14.JEIT251147
Funds:  Item1: Natural Science Foundation of China (61901358), Item2: Natural Science Basic Research Plan in Shaanxi Province of China (2025JC-YBMS-701), Item3: Outstanding Youth Science Fund of Xi’an University of Science and Technology (2020YQ3-09),; Item4: China Postdoctoral Science Foundation (2020M673347)
  • Accepted Date: 2026-03-03
  • Rev Recd Date: 2026-03-03
  • Available Online: 2026-03-15
  •   Objective  Remote sensing image scene classification aims to assign semantic labels to aerial or satellite imagery. With the rapid development of earth observation technologies, high-resolution remote sensing images contain abundant details but highlight significant challenges, including complex spatial structures, large scale variations, high intra-class variance, and strong inter-class similarity. Traditional Convolutional Neural Networks (CNNs) achieve notable success in local spatial modeling but struggle to adequately model long-range dependencies due to fixed receptive fields. To overcome this, CNN-Transformer hybrid architectures are proposed to balance local details and global semantics. However, such models typically employ simple concatenation when fusing multi-scale features, introducing redundancy and weakening discriminability. Furthermore, while the Swin Transformer utilizes window-based self-attention to capture contextual information, it exhibits profound limitations when processing complex high-resolution images. Specifically, cross-window long-range dependency modeling is restricted by the fixed window size. The extraction of fine-grained local features is also limited, as deep networks tend to ignore crucial fine texture supplements from low- and mid-level features. Moreover, existing multi-level feature fusion strategies lack semantic guidance, easily introducing background noise. Therefore, constructing a network that balances global contextual modeling with local discriminability while realizing adaptive fusion remains a critical problem.  Methods  To address the limitations of cross-window interaction and the lack of semantic guidance during multi-level feature fusion, an Entropy-driven Adaptive Fusion Swin Transformer Network (E-AF-ST) is proposed. The architecture utilizes a lightweight Swin-Tiny backbone and embeds two key innovative modules: the Attention-guided Region Selection and Feature Optimization Module (ASO) and the Entropy-driven Gated Fusion Module (EGF) (Fig. 1). The ASO module resolves the weak cross-window interaction and insufficient fine-grained feature extraction of the Swin Transformer through three consecutive stages (Fig. 2a). First, a cross-window sparse attention computation eliminates physical window boundaries. By expanding the patch partition size, sparse attention is applied across the entire image sequence, capturing global contextual correlations spanning the whole image. Second, dynamic region selection is executed. Based on a pixel-level entropy measurement, a Multilayer Perceptron maps entropy features into attention scores, and a Top-K masking strategy dynamically screens the most informative discriminative regions. Third, feature recursive optimization applies multi-head self-attention and layer normalization at the local scale to progressively enhance boundaries and micro-structural information. Subsequently, the EGF module integrates the Swin Transformer output features, the globally enhanced context features, and the locally optimized features to mitigate semantic discrepancies (Fig. 2b). Initially, energy normalization is conducted using the Frobenius norm to obtain a probabilized energy distribution. Then, an entropy-driven gated fusion mechanism computes the Shannon entropy for each branch. A learnable soft-normalization gating function maps the entropy information into normalized fusion weights, automatically reducing the weight of branches exhibiting high entropy due to cluttered backgrounds. Finally, the fused representations undergo lightweight recursive optimization utilizing depth-wise separable convolutions and GELU activation functions with residual connections to suppress redundant information. The forward propagation process is systematically summarized in the algorithm (Algorithm 1).  Results and Discussions  To validate the discriminative capability of the proposed network, extensive experimental evaluations were conducted on two widely adopted public datasets: the AID dataset and the NWPU-RESISC45 dataset. The proposed E-AF-ST network demonstrates superior classification performance compared to existing advanced methods (Table 1). On the AID dataset, the model achieves state-of-the-art overall accuracies of 95.56% and 97.21% under 20% and 50% training ratios. On the challenging NWPU-RESISC45 dataset, it achieves highest accuracies of 92.45% and 94.59% under 10% and 20% training ratios. The confusion matrices reveal that the recognition accuracy for most categories exceeds 95% (Fig. 7), and misclassification proportions in classes with complex backgrounds are significantly lower than the baseline model (Fig. 8). Visual analysis using Grad-CAM technology validates the advantages of the E-AF-ST network in global contextual modeling and critical region screening. Compared to the Swin-Tiny baseline, the proposed network demonstrates precise semantic focusing capabilities (Fig. 10). In "airport" and "port" scenes, the model successfully suppresses background noise, accurately highlighting key targets. In structurally complex scenes like "viaducts" and "railway stations", it comprehensively captures the extension directions and textures. Ablation experiments confirm that the cross-window sparse attention in the ASO module and the dynamic weight allocation in the EGF module are highly complementary. Furthermore, the E-AF-ST network achieves this performance enhancement with a minimal parameter increase, totaling only 30.45M parameters and 4.72G FLOPs.  Conclusions  This paper proposes an Entropy-driven Adaptive Fusion Swin Transformer Network (E-AF-ST) to tackle insufficient local discriminative information extraction, cross-scale feature inconsistency, and semantic redundancy in high-resolution remote sensing image scene classification. By introducing information entropy as a guiding metric, the ASO module achieves precise screening and recursive optimization of discriminative regions, while the EGF module realizes adaptive, redundancy-free integration of multi-source features. Experimental and visual results demonstrate that the proposed method effectively overcomes complex background interference, outperforming existing mainstream CNN and Transformer hybrid architectures. This work provides a novel theoretical perspective and technical pathway for addressing multi-scale target perception and feature semantic alignment.
  • loading
  • [1]
    李大湘, 南艺璇, 刘颖. 面向遥感图像场景分类的双知识蒸馏模型[J]. 电子与信息学报, 2023, 45(10): 3558–3567. doi: 10.11999/JEIT221017.

    LI Daxiang, NAN Yixuan, and LIU Ying. A double knowledge distillation model for remote sensing image scene classification[J]. Journal of Electronics & Information Technology, 2023, 45(10): 3558–3567. doi: 10.11999/JEIT221017.
    [2]
    吴倩倩, 倪康, 郑志忠. 基于双阶段高阶Transformer的遥感图像场景分类[J]. 遥感学报, 2025, 29(3): 792–807. doi: 10.11834/jrs.20233332.

    WU Qianqian, NI Kang, and ZHENG Zhizhong. Remote sensing image scene classification on the basis of a two-stage high-order Transformer[J]. National Remote Sensing Bulletin, 2025, 29(3): 792–807. doi: 10.11834/jrs.20233332.
    [3]
    CHEN Jianlai, XIONG Rongqi, YU Hanwen, et al. Microwave photonic synthetic aperture radar: Systems, experiments, and imaging processing[J]. IEEE Geoscience and Remote Sensing Magazine, 2025, 13(2): 314–328. doi: 10.1109/MGRS.2024.3444777.
    [4]
    尹文昕, 于海琛, 刁文辉, 等. 遥感场景理解中视觉Transformer的参数高效微调[J]. 电子与信息学报, 2024, 46(9): 3731–3738. doi: 10.11999/JEIT240218.

    YIN Wenxin, YU Haichen, DIAO Wenhui, et al. Parameter efficient fine-tuning of vision transformers for remote sensing scene understanding[J]. Journal of Electronics & Information Technology, 2024, 46(9): 3731–3738. doi: 10.11999/JEIT240218.
    [5]
    CHENG Gong, HAN Junwei, and LU Xiaoqiang. Remote sensing image scene classification: Benchmark and state of the art[J]. Proceedings of the IEEE, 2017, 105(10): 1865–1883. doi: 10.1109/JPROC.2017.2675998.
    [6]
    HOU Yan’e, YANG Kang, DANG Lanxue, et al. Contextual spatial-channel attention network for remote sensing scene classification[J]. IEEE Geoscience and Remote Sensing Letters, 2023, 20: 6008805. doi: 10.1109/LGRS.2023.3304645.
    [7]
    SHI Jiacheng, LIU Wei, SHAN Haoyu, et al. Remote sensing scene classification based on multibranch fusion attention network[J]. IEEE Geoscience and Remote Sensing Letters, 2023, 20: 3001505. doi: 10.1109/LGRS.2023.3262407.
    [8]
    PAN Wenwen, SUN Xiaofei, WANG Yilun, et al. Enhanced photovoltaic panel defect detection via adaptive complementary fusion in YOLO-ACF[J]. Scientific Reports, 2024, 14(1): 26425. doi: 10.1038/s41598-024-75772-9.
    [9]
    徐从安, 吕亚飞, 张筱晗, 等. 基于双重注意力机制的遥感图像场景分类特征表示方法[J]. 电子与信息学报, 2021, 43(3): 683–691. doi: 10.11999/JEIT200568.

    XU Congan, LÜ Yafei, ZHANG Xiaohan, et al. A discriminative feature representation method based on dual attention mechanism for remote sensing image scene classification[J]. Journal of Electronics & Information Technology, 2021, 43(3): 683–691. doi: 10.11999/JEIT200568.
    [10]
    SONG Jiayin, FAN Yiming, SONG Wenlong, et al. SwinHCST: A deep learning network architecture for scene classification of remote sensing images based on improved CNN and transformer[J]. International Journal of Remote Sensing, 2023, 44(23): 7439–7463. doi: 10.1080/01431161.2023.2285739.
    [11]
    HUANG Xinyan, LIU Fang, CUI Yuanhao, et al. Faster and better: A lightweight transformer network for remote sensing scene classification[J]. Remote Sensing, 2023, 15(14): 3645. doi: 10.3390/rs15143645.
    [12]
    DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021: 1–21. (查阅网上资料, 未找到出版地信息, 请确认).
    [13]
    LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, 2021: 10012–10022. doi: 10.1109/ICCV48922.2021.00986.
    [14]
    JANNAT F E and WILLIS A R. Improving classification of remotely sensed images with the Swin transformer[C]. SoutheastCon 2022, Mobile, USA, 2022: 611–618. doi: 10.1109/SoutheastCon48659.2022.9764016.
    [15]
    CHANG Jing, HE Xiaohui, SONG Dingjun, et al. A multi-scale attention network for building extraction from high-resolution remote sensing images[J]. Scientific Reports, 2025, 15(1): 24938. doi: 10.1038/s41598-025-09086-9.
    [16]
    YE Zhipin, LIU Yingqian, JING Teng, et al. A high-resolution network with strip attention for retinal vessel segmentation[J]. Sensors, 2023, 23(21): 8899. doi: 10.3390/s23218899.
    [17]
    YU Shihai, ZHANG Xu, and SONG Huihui. Sparse mix-attention transformer for multispectral image and hyperspectral image fusion[J]. Remote Sensing, 2024, 16(1): 144. doi: 10.3390/rs16010144.
    [18]
    XIA Guisong, HU Jingwen, HU Fan, et al. AID: A benchmark data set for performance evaluation of aerial scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(7): 3965–3981. doi: 10.1109/TGRS.2017.2685945.
    [19]
    CHEN Jianlai, XIONG Rongqi, JIANG Nan, et al. High phase-preserving autofocus imaging for squinted airborne synthetic aperture radar[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5215315. doi: 10.1109/TGRS.2025.3587539.
    [20]
    ZHAO Zhicheng, LI Jiaqi, LUO Ze, et al. Remote sensing image scene classification based on an enhanced attention module[J]. IEEE Geoscience and Remote Sensing Letters, 2021, 18(11): 1926–1930. doi: 10.1109/LGRS.2020.3011405.
    [21]
    WANG Junjie, LI Wei, ZHANG Mengmeng, et al. Remote-sensing scene classification via multistage self-guided separation network[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5615312. doi: 10.1109/TGRS.2023.3295797.
    [22]
    YANG Yuqun, TANG Xu, CHEUNG Y M, et al. SAGN: Semantic-aware graph network for remote sensing scene classification[J]. IEEE Transactions on Image Processing, 2023, 32: 1011–1025. doi: 10.1109/TIP.2023.3238310.
    [23]
    TANG Xu, LI Mingteng, MA Jingjing, et al. EMTCAL: Efficient multiscale transformer and cross-level attention learning for remote sensing scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5626915. doi: 10.1109/TGRS.2022.3194505.
    [24]
    TANG Xu, MA Qinshuo, ZHANG Xiangrong, et al. Attention consistent network for remote sensing scene classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 2030–2045. doi: 10.1109/JSTARS.2021.3051569.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(5)  / Tables(3)

    Article Metrics

    Article views (47) PDF downloads(3) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return