Parameter Efficient Fine-tuning of Vision Transformers for Remote Sensing Scene Understanding

YIN Wenxin; YU Haichen; DIAO Wenhui; SUN Xian; FU Kun

doi:10.11999/JEIT240218

Volume 46 Issue 9

Sep. 2024

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2024 > 46(9): 3731-3738

Lin Haian, Wu Chongruo, Ma Jun. THREE-UNIT GAS-SENSING ARRAY AND ORGANIC SOLVENT DISCRIMINATION[J]. Journal of Electronics & Information Technology, 1995, 17(2): 170-174.

Citation:

YIN Wenxin, YU Haichen, DIAO Wenhui, SUN Xian, FU Kun. Parameter Efficient Fine-tuning of Vision Transformers for Remote Sensing Scene Understanding[J]. Journal of Electronics & Information Technology, 2024, 46(9): 3731-3738. doi: 10.11999/JEIT240218

Lin Haian, Wu Chongruo, Ma Jun. THREE-UNIT GAS-SENSING ARRAY AND ORGANIC SOLVENT DISCRIMINATION[J]. Journal of Electronics & Information Technology, 1995, 17(2): 170-174.

Citation:

PDF( 1499 KB)

Parameter Efficient Fine-tuning of Vision Transformers for Remote Sensing Scene Understanding

doi: 10.11999/JEIT240218

YIN Wenxin^{1, 3},
YU Haichen^{1, 2, 3
,
,},
DIAO Wenhui^{1, 3},
SUN Xian^{1, 2, 3},
FU Kun^{1, 3}

1.
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
2.
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
3.
Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

Funds: The National Key R&D Program of China (2022ZD0118401)

Received Date: 2024-03-29
Rev Recd Date: 2024-07-17

Available Online: 2024-08-02

Publish Date: 2024-09-26

Abstract

Abstract

With the rapid development of deep learning and computer vision technologies, fine-tuning pre-trained models for remote sensing tasks often requires substantial computational resources. To reduce memory requirements and training costs, a method called “Multi-Fusion Adapter (MuFA)” for fine-tuning remote sensing models is proposed in this paper. MuFA introduces a fusion module that combines bottleneck modules with different down sample rates and connects them in parallel with the original vision Transformer model. During training, the parameters of the original vision Transformer model are frozen, and only the MuFA module and classification head are fine-tuned. Experimental results demonstrate that MuFA achieves superior performance on the UCM and NWPU-RESISC45 remote sensing scene classification datasets, surpassing other parameter efficient fine-tuning methods. Therefore, MuFA not only maintains model performance but also reduces resource overhead, making it highly promising for various remote sensing applications.
- Remote sensing,
- Scene classification,
- Parameter efficient,
- Deep learning

FullText(HTML)

References(28)

References

[1]	王佩瑾, 闫志远, 容雪娥, 等. 数据受限条件下的多模态处理技术综述[J]. 中国图象图形学报, 2022, 27(10): 2803–2834. doi: 10.11834/jig.220049. WANG Peijin, YAN Zhiyuan, RONG Xue’e, et al. Review of multimodal data processing techniques with limited data[J]. Journal of Image and Graphics, 2022, 27(10): 2803–2834. doi: 10.11834/jig.220049.
[2]	BI Hanbo, FENG Yingchao, YAN Zhiyuan, et al. Not just learning from others but relying on yourself: A new perspective on few-shot segmentation in remote sensing[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5623621. doi: 10.1109/TGRS.2023.3326292.
[3]	WANG Peijin, SUN Xian, DIAO Wenhui, et al. FMSSD: Feature-merged single-shot detection for multiscale objects in large-scale remote sensing imagery[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(5): 3377–3390. doi: 10.1109/TGRS.2019.2954328.
[4]	KRIZHEVSKY A, SUTSKEVER I, and HINTON G E. ImageNet classification with deep convolutional neural networks[C]. Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, 2012: 1097–1105.
[5]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.
[6]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. 9th International Conference on Learning Representations, Austria, 2021.
[7]	LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]. The IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 9992–10002. doi: 10.1109/ICCV48922.2021.00986.
[8]	DENG Jia, DONG Wei, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]. 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, USA, 2009: 248–255. doi: 10.1109/CVPR.2009.5206848.
[9]	BEN ZAKEN E, GOLDBERG Y, and RAVFOGEL S. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models[C]. The 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, 2022: 1–9. doi: 10.18653/v1/2022.acl-short.1.
[10]	HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter-efficient transfer learning for NLP[C]. 36th International Conference on Machine Learning, Long Beach, USA, 2019: 2790–2799.
[11]	HU E J, SHEN Yelong, WALLIS P, et al. LoRA: Low-rank adaptation of large language models[C]. 10th International Conference on Learning Representations, 2022.
[12]	YIN Dongshuo, YANG Yiran, WANG Zhechao, et al. 1% vs 100%: Parameter-efficient low rank adapter for dense predictions[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 20116–20126. doi: 10.1109/CVPR52729.2023.01926.
[13]	CHEN Shoufa, GE Chongjian, TONG Zhan, et al. AdaptFormer: Adapting vision transformers for scalable visual recognition[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1212.
[14]	JIE Shibo and DENG Zhihong. Convolutional bypasses are better vision transformer adapters[EB/OL]. https://arxiv.org/abs/2207.07039, 2022.
[15]	SIMONYAN K and ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]. 3rd International Conference on Learning Representations, San Diego, USA, 2015.
[16]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
[17]	LESTER B, AL-RFOU R, and CONSTANT N. The power of scale for parameter-efficient prompt tuning[C]. The 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021: 3045–3059. doi: 10.18653/v1/2021.emnlp-main.243.
[18]	JIA Menglin, TANG Luming, CHEN B C, et al. Visual prompt tuning[C]. 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 709–727. doi: 10.1007/978-3-031-19827-4_41.
[19]	YOO S, KIM E, JUNG D, et al. Improving visual prompt tuning for self-supervised vision transformers[C]. 40th International Conference on Machine Learning, Honolulu, USA, 2023: 40075–40092.
[20]	TU Chenghao, MAI Zheda, and CHAO Weilun. Visual query tuning: Towards effective usage of intermediate representations for parameter and memory efficient transfer learning[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 7725–7735. doi: 10.1109/CVPR52729.2023.00746.
[21]	HAN Cheng, WANG Qifan, CUI Yiming, et al. E²VPT: An effective and efficient approach for visual prompt tuning[C]. 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 17445–17456. doi: 10.1109/ICCV51070.2023.01604.
[22]	HE Xuehai, LI Chunyuan, ZHANG Pengchuan, et al. Parameter-efficient model adaptation for vision transformers[C]. Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, USA, 2023: 817–825. doi: 10.1609/aaai.v37i1.25160.
[23]	ZHANG Yuanhan, ZHOU Kaiyang, and LIU Ziwei. Neural prompt search[EB/OL]. https://arxiv.org/abs/2206.04673, 2022.
[24]	QI Wang, RUAN Yuping, ZUO Yuan, et al. Parameter-efficient tuning on layer normalization for pre-trained language models[EB/OL]. https://arxiv.org/abs/2211.08682, 2022.
[25]	YANG Yi and NEWSAM S. Bag-of-visual-words and spatial extensions for land-use classification[C]. The 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, USA, 2010: 270–279. doi: 10.1145/1869790.1869829.
[26]	CHENG Gong, HAN Junwei, and LU Xiaoqiang. Remote sensing image scene classification: Benchmark and state of the art[J]. Proceedings of the IEEE, 2017, 105(10): 1865–1883. doi: 10.1109/JPROC.2017.2675998.
[27]	SUN Xian, WANG Peijin, LU Wanxuan, et al. RingMo: A remote sensing foundation model with masked image modeling[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5612822. doi: 10.1109/TGRS.2022.3194732.
[28]	GUO Xin, LAO Jiangwei, DANG Bo, et al. SkySense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery[EB/OL]. https://arxiv.org/abs/2312.10115, 2023.