Construction and Scene Classification Research of Entropy-Driven Adaptive Fusion Networks for High-Resolution Remote Sensing Images
-
摘要: 高分辨遥感图像场景分类因复杂背景、多样成像条件及类内差异大等因素面临显著挑战,而传统卷积神经网络(CNN)方法在全局上下文建模方面存在局限,Swin Transformer在跨窗口特征交互、细粒度局部特征提取以及多层次特征自适应融合方面仍存在不足。针对上述问题,该文提出一种面向高分辨遥感图像场景分类的熵驱动自适应融合网络,主要创新与贡献概括如下:(1)设计注意力引导的区域筛选与特征优化模块(ASO),通过跨窗口稀疏注意力增强全局建模能力并筛选关键区域,结合递归优化强化局部特征表示,增强了模型跨窗口交互能力与细粒度局部特征判别性;(2)构建熵驱动门控融合模块(EGF),利用熵指导的门控机制对Swin特征、全局上下文与优化后的局部特征进行自适应融合,克服多层次特征简单融合易引入冗余的问题;(3)在AID与NWPU-RESISC45公开数据集上的实验表明,所提方法在分类精度上优于多种现有先进方法,展现出良好的鲁棒性与泛化能力。
-
关键词:
- 遥感图像场景分类 /
- 熵 /
- 特征融合 /
- Swin-Transformer网络
Abstract:Objective Remote sensing image scene classification aims to assign semantic labels to aerial or satellite imagery. With the rapid development of earth observation technologies, high-resolution remote sensing images contain abundant details but highlight significant challenges, including complex spatial structures, large scale variations, high intra-class variance, and strong inter-class similarity. Traditional Convolutional Neural Networks (CNNs) achieve notable success in local spatial modeling but struggle to adequately model long-range dependencies due to fixed receptive fields. To overcome this, CNN-Transformer hybrid architectures are proposed to balance local details and global semantics. However, such models typically employ simple concatenation when fusing multi-scale features, introducing redundancy and weakening discriminability. Furthermore, while the Swin Transformer utilizes window-based self-attention to capture contextual information, it exhibits profound limitations when processing complex high-resolution images. Specifically, cross-window long-range dependency modeling is restricted by the fixed window size. The extraction of fine-grained local features is also limited, as deep networks tend to ignore crucial fine texture supplements from low- and mid-level features. Moreover, existing multi-level feature fusion strategies lack semantic guidance, easily introducing background noise. Therefore, constructing a network that balances global contextual modeling with local discriminability while realizing adaptive fusion remains a critical problem. Methods To address the limitations of cross-window interaction and the lack of semantic guidance during multi-level feature fusion, an Entropy-driven Adaptive Fusion Swin Transformer Network (E-AF-ST) is proposed. The architecture utilizes a lightweight Swin-Tiny backbone and embeds two key innovative modules: the Attention-guided Region Selection and Feature Optimization Module (ASO) and the Entropy-driven Gated Fusion Module (EGF) ( Fig. 1 ). The ASO module resolves the weak cross-window interaction and insufficient fine-grained feature extraction of the Swin Transformer through three consecutive stages (Fig. 2a ). First, a cross-window sparse attention computation eliminates physical window boundaries. By expanding the patch partition size, sparse attention is applied across the entire image sequence, capturing global contextual correlations spanning the whole image. Second, dynamic region selection is executed. Based on a pixel-level entropy measurement, a Multilayer Perceptron maps entropy features into attention scores, and a Top-K masking strategy dynamically screens the most informative discriminative regions. Third, feature recursive optimization applies multi-head self-attention and layer normalization at the local scale to progressively enhance boundaries and micro-structural information. Subsequently, the EGF module integrates the Swin Transformer output features, the globally enhanced context features, and the locally optimized features to mitigate semantic discrepancies (Fig. 2b ). Initially, energy normalization is conducted using the Frobenius norm to obtain a probabilized energy distribution. Then, an entropy-driven gated fusion mechanism computes the Shannon entropy for each branch. A learnable soft-normalization gating function maps the entropy information into normalized fusion weights, automatically reducing the weight of branches exhibiting high entropy due to cluttered backgrounds. Finally, the fused representations undergo lightweight recursive optimization utilizing depth-wise separable convolutions and GELU activation functions with residual connections to suppress redundant information. The forward propagation process is systematically summarized in the algorithm (Algorithm 1).Results and Discussions To validate the discriminative capability of the proposed network, extensive experimental evaluations were conducted on two widely adopted public datasets: the AID dataset and the NWPU-RESISC45 dataset. The proposed E-AF-ST network demonstrates superior classification performance compared to existing advanced methods ( Table 1 ). On the AID dataset, the model achieves state-of-the-art overall accuracies of 95.56% and 97.21% under 20% and 50% training ratios. On the challenging NWPU-RESISC45 dataset, it achieves highest accuracies of 92.45% and 94.59% under 10% and 20% training ratios. The confusion matrices reveal that the recognition accuracy for most categories exceeds 95% (Fig. 7 ), and misclassification proportions in classes with complex backgrounds are significantly lower than the baseline model (Fig. 8 ). Visual analysis using Grad-CAM technology validates the advantages of the E-AF-ST network in global contextual modeling and critical region screening. Compared to the Swin-Tiny baseline, the proposed network demonstrates precise semantic focusing capabilities (Fig. 10 ). In "airport" and "port" scenes, the model successfully suppresses background noise, accurately highlighting key targets. In structurally complex scenes like "viaducts" and "railway stations", it comprehensively captures the extension directions and textures. Ablation experiments confirm that the cross-window sparse attention in the ASO module and the dynamic weight allocation in the EGF module are highly complementary. Furthermore, the E-AF-ST network achieves this performance enhancement with a minimal parameter increase, totaling only 30.45M parameters and 4.72G FLOPs.Conclusions This paper proposes an Entropy-driven Adaptive Fusion Swin Transformer Network (E-AF-ST) to tackle insufficient local discriminative information extraction, cross-scale feature inconsistency, and semantic redundancy in high-resolution remote sensing image scene classification. By introducing information entropy as a guiding metric, the ASO module achieves precise screening and recursive optimization of discriminative regions, while the EGF module realizes adaptive, redundancy-free integration of multi-source features. Experimental and visual results demonstrate that the proposed method effectively overcomes complex background interference, outperforming existing mainstream CNN and Transformer hybrid architectures. This work provides a novel theoretical perspective and technical pathway for addressing multi-scale target perception and feature semantic alignment. -
1 E-AF-ST网络前向传播过程
输入:高分辨遥感图像$ \boldsymbol{I}\in {\boldsymbol{R}}^{\boldsymbol{H}\times \boldsymbol{W}\times \boldsymbol{C}} $ 参数:ASO递归次数$ \boldsymbol{T} $,EGF递归次数$ \boldsymbol{K} $ 输出:场景分类概率分布$ \boldsymbol{P} $ 1://第一阶段:多层特征提取 2:$ {\boldsymbol{F}}_{\boldsymbol{s}\boldsymbol{w}\boldsymbol{i}\boldsymbol{n}}\leftarrow \text{SwinTiny}\left(\boldsymbol{I}\right) $ 3://第二阶段:区域筛选与特征优化 4:$ {\boldsymbol{F}}_{\boldsymbol{g}\boldsymbol{l}\boldsymbol{o}\boldsymbol{b}\boldsymbol{a}\boldsymbol{l}}\leftarrow \text{SparseAttention}\left({\boldsymbol{F}}_{\boldsymbol{s}\boldsymbol{w}\boldsymbol{i}\boldsymbol{n}}\right) $ 5:$ \boldsymbol{\alpha }\leftarrow \text{Sigmoid}\left(\text{MLP}\left(\text{Entropy}\left({\boldsymbol{F}}_{\boldsymbol{g}\boldsymbol{l}\boldsymbol{o}\boldsymbol{b}\boldsymbol{a}\boldsymbol{l}}\right)\right)\right) $ 6:$ \boldsymbol{M}\leftarrow \text{TopK}\left(\boldsymbol{\alpha }\right) $ 7:$ {\boldsymbol{F}}_{\boldsymbol{s}\boldsymbol{e}\boldsymbol{l}}\leftarrow {\boldsymbol{F}}_{\boldsymbol{g}\boldsymbol{l}\boldsymbol{o}\boldsymbol{b}\boldsymbol{a}\boldsymbol{l}}\odot \boldsymbol{M} $ 8:$ {\boldsymbol{S}}_{\mathbf{0}}\leftarrow {\boldsymbol{F}}_{\boldsymbol{s}\boldsymbol{e}\boldsymbol{l}} $ 9:For $ \boldsymbol{t}=\mathbf{1} $ to $ \boldsymbol{T} $ do 10: $ {\boldsymbol{S}}_{\boldsymbol{t}}\leftarrow \text{LayerNorm}\left({\boldsymbol{S}}_{\boldsymbol{t}-\mathbf{1}}\right)+\text{MHSA}\left({\boldsymbol{S}}_{\boldsymbol{t}-\mathbf{1}}\right) $ 11:End For 12:$ {\boldsymbol{F}}_{\boldsymbol{l}\boldsymbol{o}\boldsymbol{c}\boldsymbol{a}\boldsymbol{l}}\leftarrow \text{Reshape}\left({\boldsymbol{S}}_{\boldsymbol{T}}\right) $ 13://第三阶段:熵驱动门控融合 14:$ {\boldsymbol{P}}_{\boldsymbol{s}\boldsymbol{w}\boldsymbol{i}\boldsymbol{n}},{\boldsymbol{P}}_{\boldsymbol{g}\boldsymbol{l}\boldsymbol{o}\boldsymbol{b}\boldsymbol{a}\boldsymbol{l}},{\boldsymbol{P}}_{\boldsymbol{l}\boldsymbol{o}\boldsymbol{c}\boldsymbol{a}\boldsymbol{l}}\leftarrow \text{EnergyNorm}\left({\boldsymbol{F}}_{\boldsymbol{s}\boldsymbol{w}\boldsymbol{i}\boldsymbol{n}},\right. $
$\left. {\boldsymbol{F}}_{\boldsymbol{g}\boldsymbol{l}\boldsymbol{o}\boldsymbol{b}\boldsymbol{a}\boldsymbol{l}},{\boldsymbol{F}}_{\boldsymbol{l}\boldsymbol{o}\boldsymbol{c}\boldsymbol{a}\boldsymbol{l}}\right) $15:$ {\boldsymbol{H}}_{\boldsymbol{s}},{\boldsymbol{H}}_{\boldsymbol{g}},{\boldsymbol{H}}_{\boldsymbol{l}}\leftarrow \text{CalcEntropy}\left({\boldsymbol{P}}_{\boldsymbol{s}\boldsymbol{w}\boldsymbol{i}\boldsymbol{n}},{\boldsymbol{P}}_{\boldsymbol{g}\boldsymbol{l}\boldsymbol{o}\boldsymbol{b}\boldsymbol{a}\boldsymbol{l}},{\boldsymbol{P}}_{\boldsymbol{l}\boldsymbol{o}\boldsymbol{c}\boldsymbol{a}\boldsymbol{l}}\right) $ 16:$ \boldsymbol{w}\leftarrow \text{Softmax}\left(\boldsymbol{\alpha }\cdot \left[\mathbf{1}-{\boldsymbol{H}}_{\boldsymbol{s}},\mathbf{1}-{\boldsymbol{H}}_{\boldsymbol{g}},\mathbf{1}-{\boldsymbol{H}}_{\boldsymbol{l}}\right]\right) $ 17:$ {\boldsymbol{F}}_{\boldsymbol{m}\boldsymbol{i}\boldsymbol{x}}\leftarrow {\boldsymbol{w}}_{\mathbf{1}}{\boldsymbol{F}}_{\boldsymbol{s}\boldsymbol{w}\boldsymbol{i}\boldsymbol{n}}+{\boldsymbol{w}}_{\mathbf{2}}{\boldsymbol{F}}_{\boldsymbol{g}\boldsymbol{l}\boldsymbol{o}\boldsymbol{b}\boldsymbol{a}\boldsymbol{l}}+{\boldsymbol{w}}_{\mathbf{3}}{\boldsymbol{F}}_{\boldsymbol{l}\boldsymbol{o}\boldsymbol{c}\boldsymbol{a}\boldsymbol{l}} $ 18:$ \boldsymbol{F}_{\boldsymbol{r}\boldsymbol{e}\boldsymbol{f}}^{\mathbf{0}}\leftarrow {\boldsymbol{F}}_{\boldsymbol{m}\boldsymbol{i}\boldsymbol{x}} $ 19:For $ \boldsymbol{k}=\mathbf{1} $ to $ \boldsymbol{K} $ do 20: $ \boldsymbol{F}_{\boldsymbol{r}\boldsymbol{e}\boldsymbol{f}}^{\boldsymbol{k}}\leftarrow \text{LayerNorm}\left(\text{Conv}\left(\text{GELU}\left(\boldsymbol{F}_{\boldsymbol{r}\boldsymbol{e}\boldsymbol{f}}^{\boldsymbol{k}-\mathbf{1}}\right)\right)\right)+\boldsymbol{F}_{\boldsymbol{r}\boldsymbol{e}\boldsymbol{f}}^{\boldsymbol{k}-\mathbf{1}} $ 21:End For 22:$ {\boldsymbol{F}}_{\boldsymbol{f}\boldsymbol{u}\boldsymbol{s}\boldsymbol{e}\boldsymbol{d}}\leftarrow \boldsymbol{F}_{\boldsymbol{r}\boldsymbol{e}\boldsymbol{f}}^{\boldsymbol{K}}+{\boldsymbol{F}}_{\boldsymbol{m}\boldsymbol{i}\boldsymbol{x}} $ 23://第四阶段:分类输出 24:$ \boldsymbol{y}\leftarrow \text{GlobalAvgPool}\left({\boldsymbol{F}}_{\boldsymbol{f}\boldsymbol{u}\boldsymbol{s}\boldsymbol{e}\boldsymbol{d}}\right) $ 25:$ \boldsymbol{P}\leftarrow \text{Softmax}\left(\text{Classifier}\left(\boldsymbol{y}\right)\right) $ 26:Return $ \boldsymbol{P} $ 表 1 不同训练比例下消融实验OA值(%)
AID NWPU-RESISC45 20%训练集 50%训练集 10%训练集 20%训练集 基线 94.56 96.92 90.84 93.18 +ASO 95.12 97.05 91.52 93.97 +EGF 94.78 96.89 91.2 93.65 E-AF-ST 95.56 97.21 92.45 94.59 表 2 不同方法在AID与在NWPU-RESISC45数据集上的分类准确率对比(%)
方法 AID数据集 NWPU-RESISC45数据集 Params
(M)FLOPs
(G)20%训练集 50%训练集 10%训练集 20%训练集 Swin-Tiny[14] 94.56±0.14 96.92±0.12 90.84±0.09 93.18±0.15 29 4.5 ResNet101+EAM[22] 94.26±0.11 97.06±0.19 91.91±0.22 94.29±0.09 - - MGS-Net[23] 95.46±0.21 97.18±0.16 92.40±0.16 94.57±0.12 - - SAGN[24] 95.17±0.12 96.77±0.18 91.73±0.18 93.49±0.10 - - CSCA-Net[6] 94.67±0.20 96.83±0.14 91.27±0.11 93.72±0.10 - - MBAF-Net[7] 93.98±0.15 96.93±0.16 91.61±0.14 94.01±0.08 24.48 4.51 EMTCAL[23] 94.69±0.14 96.41±0.23 91.63±0.19 93.65±0.12 - - AC-Net[24] 93.33±0.29 95.38±0.29 91.09±0.13 92.42±0.16 - - E-AF-ST(ours) 95.56±0.19 97.21±0.16 92.45±0.15 94.59±0.11 30.45 4.72 -
[1] 李大湘, 南艺璇, 刘颖. 面向遥感图像场景分类的双知识蒸馏模型[J]. 电子与信息学报, 2023, 45(10): 3558–3567. doi: 10.11999/JEIT221017.LI Daxiang, NAN Yixuan, and LIU Ying. A double knowledge distillation model for remote sensing image scene classification[J]. Journal of Electronics & Information Technology, 2023, 45(10): 3558–3567. doi: 10.11999/JEIT221017. [2] 吴倩倩, 倪康, 郑志忠. 基于双阶段高阶Transformer的遥感图像场景分类[J]. 遥感学报, 2025, 29(3): 792–807. doi: 10.11834/jrs.20233332.WU Qianqian, NI Kang, and ZHENG Zhizhong. Remote sensing image scene classification on the basis of a two-stage high-order Transformer[J]. National Remote Sensing Bulletin, 2025, 29(3): 792–807. doi: 10.11834/jrs.20233332. [3] CHEN Jianlai, XIONG Rongqi, YU Hanwen, et al. Microwave photonic synthetic aperture radar: Systems, experiments, and imaging processing[J]. IEEE Geoscience and Remote Sensing Magazine, 2025, 13(2): 314–328. doi: 10.1109/MGRS.2024.3444777. [4] 尹文昕, 于海琛, 刁文辉, 等. 遥感场景理解中视觉Transformer的参数高效微调[J]. 电子与信息学报, 2024, 46(9): 3731–3738. doi: 10.11999/JEIT240218.YIN Wenxin, YU Haichen, DIAO Wenhui, et al. Parameter efficient fine-tuning of vision transformers for remote sensing scene understanding[J]. Journal of Electronics & Information Technology, 2024, 46(9): 3731–3738. doi: 10.11999/JEIT240218. [5] CHENG Gong, HAN Junwei, and LU Xiaoqiang. Remote sensing image scene classification: Benchmark and state of the art[J]. Proceedings of the IEEE, 2017, 105(10): 1865–1883. doi: 10.1109/JPROC.2017.2675998. [6] HOU Yan’e, YANG Kang, DANG Lanxue, et al. Contextual spatial-channel attention network for remote sensing scene classification[J]. IEEE Geoscience and Remote Sensing Letters, 2023, 20: 6008805. doi: 10.1109/LGRS.2023.3304645. [7] SHI Jiacheng, LIU Wei, SHAN Haoyu, et al. Remote sensing scene classification based on multibranch fusion attention network[J]. IEEE Geoscience and Remote Sensing Letters, 2023, 20: 3001505. doi: 10.1109/LGRS.2023.3262407. [8] PAN Wenwen, SUN Xiaofei, WANG Yilun, et al. Enhanced photovoltaic panel defect detection via adaptive complementary fusion in YOLO-ACF[J]. Scientific Reports, 2024, 14(1): 26425. doi: 10.1038/s41598-024-75772-9. [9] 徐从安, 吕亚飞, 张筱晗, 等. 基于双重注意力机制的遥感图像场景分类特征表示方法[J]. 电子与信息学报, 2021, 43(3): 683–691. doi: 10.11999/JEIT200568.XU Congan, LÜ Yafei, ZHANG Xiaohan, et al. A discriminative feature representation method based on dual attention mechanism for remote sensing image scene classification[J]. Journal of Electronics & Information Technology, 2021, 43(3): 683–691. doi: 10.11999/JEIT200568. [10] SONG Jiayin, FAN Yiming, SONG Wenlong, et al. SwinHCST: A deep learning network architecture for scene classification of remote sensing images based on improved CNN and transformer[J]. International Journal of Remote Sensing, 2023, 44(23): 7439–7463. doi: 10.1080/01431161.2023.2285739. [11] HUANG Xinyan, LIU Fang, CUI Yuanhao, et al. Faster and better: A lightweight transformer network for remote sensing scene classification[J]. Remote Sensing, 2023, 15(14): 3645. doi: 10.3390/rs15143645. [12] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021: 1–21. (查阅网上资料, 未找到出版地信息, 请确认). [13] LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, 2021: 10012–10022. doi: 10.1109/ICCV48922.2021.00986. [14] JANNAT F E and WILLIS A R. Improving classification of remotely sensed images with the Swin transformer[C]. SoutheastCon 2022, Mobile, USA, 2022: 611–618. doi: 10.1109/SoutheastCon48659.2022.9764016. [15] CHANG Jing, HE Xiaohui, SONG Dingjun, et al. A multi-scale attention network for building extraction from high-resolution remote sensing images[J]. Scientific Reports, 2025, 15(1): 24938. doi: 10.1038/s41598-025-09086-9. [16] YE Zhipin, LIU Yingqian, JING Teng, et al. A high-resolution network with strip attention for retinal vessel segmentation[J]. Sensors, 2023, 23(21): 8899. doi: 10.3390/s23218899. [17] YU Shihai, ZHANG Xu, and SONG Huihui. Sparse mix-attention transformer for multispectral image and hyperspectral image fusion[J]. Remote Sensing, 2024, 16(1): 144. doi: 10.3390/rs16010144. [18] XIA Guisong, HU Jingwen, HU Fan, et al. AID: A benchmark data set for performance evaluation of aerial scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(7): 3965–3981. doi: 10.1109/TGRS.2017.2685945. [19] CHEN Jianlai, XIONG Rongqi, JIANG Nan, et al. High phase-preserving autofocus imaging for squinted airborne synthetic aperture radar[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5215315. doi: 10.1109/TGRS.2025.3587539. [20] ZHAO Zhicheng, LI Jiaqi, LUO Ze, et al. Remote sensing image scene classification based on an enhanced attention module[J]. IEEE Geoscience and Remote Sensing Letters, 2021, 18(11): 1926–1930. doi: 10.1109/LGRS.2020.3011405. [21] WANG Junjie, LI Wei, ZHANG Mengmeng, et al. Remote-sensing scene classification via multistage self-guided separation network[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5615312. doi: 10.1109/TGRS.2023.3295797. [22] YANG Yuqun, TANG Xu, CHEUNG Y M, et al. SAGN: Semantic-aware graph network for remote sensing scene classification[J]. IEEE Transactions on Image Processing, 2023, 32: 1011–1025. doi: 10.1109/TIP.2023.3238310. [23] TANG Xu, LI Mingteng, MA Jingjing, et al. EMTCAL: Efficient multiscale transformer and cross-level attention learning for remote sensing scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5626915. doi: 10.1109/TGRS.2022.3194505. [24] TANG Xu, MA Qinshuo, ZHANG Xiangrong, et al. Attention consistent network for remote sensing scene classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 2030–2045. doi: 10.1109/JSTARS.2021.3051569. -
下载:
下载: