Advanced Search
Turn off MathJax
Article Contents
WANG Peijin, HU Huiyang, FENG Yingchao, DIAO Wenhui, SUN Xian. A Large-Scale Multimodal Instruction Dataset for Remote Sensing Agents[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250818
Citation: WANG Peijin, HU Huiyang, FENG Yingchao, DIAO Wenhui, SUN Xian. A Large-Scale Multimodal Instruction Dataset for Remote Sensing Agents[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250818

A Large-Scale Multimodal Instruction Dataset for Remote Sensing Agents

doi: 10.11999/JEIT250818 cstr: 32379.14.JEIT250818
Funds:  The Science and Disruptive Technology Program, AIRCAS (2025-AIRCAS-SDTP-04)
  • Received Date: 2025-08-29
  • Accepted Date: 2026-01-12
  • Rev Recd Date: 2026-01-11
  • Available Online: 2026-01-27
  •   Objective   The rapid advancement of remote sensing (RS) technology has fundamentally reshaped the scope of Earth observation research, driving a paradigm shift from static image analysis toward intelligent, goal-oriented cognitive decision-making. Modern RS applications increasingly require systems that can autonomously perceive complex scenes, reason over heterogeneous information sources, decompose high-level objectives into executable subtasks, and make informed decisions under uncertainty. This evolution motivates the concept of remote sensing agents, which extend beyond conventional perception models to encompass reasoning, planning, and interaction capabilities. Despite this growing demand, existing RS datasets remain largely task-centric and fragmented, typically designed for single-purpose supervised learning such as object detection or land-cover classification. These datasets rarely support multimodal reasoning, instruction following, or multi-step decision-making, all of which are essential for agentic workflows. Furthermore, current RS vision-language datasets often suffer from limited scale, narrow modality coverage, and simplistic text annotations, with insufficient inclusion of non-optical data such as Synthetic Aperture Radar (SAR) and infrared imagery. They also lack explicit instruction-driven interactions that mirror real-world human–agent collaboration. To address these limitations, this study constructs a large-scale multimodal image–text instruction dataset explicitly designed for RS agents. The primary objective is to establish a unified data foundation that supports the entire cognitive chain, including perception, reasoning, planning, and decision-making. By enabling models to learn from structured instructions across diverse modalities and task types, the dataset aims to facilitate the development, training, and evaluation of next-generation RS foundation models with genuine agentic capabilities.  Methods   The dataset construction follows a systematic and extensible framework that integrates multi-source RS imagery with complex, instruction-oriented textual supervision. First, a unified input–output paradigm is defined to ensure compatibility across heterogeneous RS tasks and model architectures. This paradigm explicitly formalizes the interaction between visual inputs and language instructions, allowing models to process not only image pixels and text descriptions, but also structured spatial coordinates, region-level references, and action-oriented outputs. A standardized instruction schema is developed to encode task objectives, constraints, and expected responses in a consistent format. This schema is flexible enough to support diverse task types while remaining sufficiently structured for scalable data generation and automatic validation. The overall methodology comprises three key stages. (1) Data Collection and Integration: Multimodal RS imagery is aggregated from multiple authoritative sources, covering optical, SAR, and infrared modalities with diverse spatial resolutions, scene types, and geographic distributions. (2) Instruction Generation: A hybrid strategy is adopted that combines rule-based templates with Large Language Model (LLM)-assisted refinement. Template-based generation ensures task completeness and structural consistency, while LLM-based rewriting enhances linguistic diversity, naturalness, and instruction complexity. (3) Task Categorization and Organization: The dataset is organized into nine core task categories, spanning low-level perception, mid-level reasoning, and high-level decision-making, with a total of 21 sub-datasets. To ensure high data quality and reliability, a rigorous validation pipeline is implemented. This includes automated syntax and format checking, cross-modal consistency verification, and manual auditing of representative samples to ensure semantic alignment between visual content and textual instructions.  Results and Discussions   The resulting dataset comprises over 2 million multimodal instruction samples, making it one of the largest and most comprehensive instruction datasets in the RS domain. The integration of optical, SAR, and infrared data enables robust cross-modal learning and supports reasoning across heterogeneous sensing mechanisms. Compared with existing RS datasets, the proposed dataset places greater emphasis on instruction diversity, task compositionality, and agent-oriented interaction, rather than isolated perception objectives. Extensive baseline experiments are conducted using several state-of-the-art multimodal large language models (MLLMs) and RS-specific foundation models. The results demonstrate that the dataset effectively supports evaluation across the full spectrum of agentic capabilities, from visual grounding and reasoning to high-level decision-making. At the same time, the experiments reveal persistent challenges posed by RS data, such as extreme scale variations, dense object distributions, and long-range spatial dependencies. These findings highlight important research directions for improving multimodal reasoning and planning in complex RS environments.  Conclusions   This paper presents a pioneering large-scale multimodal image–text instruction dataset tailored for remote sensing agents. By systematically organizing information across nine core task categories and 21 sub-datasets, it provides a unified and extensible benchmark for agent-centric RS research. The main contributions include: (1) the establishment of a unified multimodal instruction paradigm for RS agents; (2) the construction of a 2-million-sample dataset covering optical, SAR, and infrared modalities; (3) empirical validation of the dataset’s effectiveness in supporting end-to-end agentic workflows from perception to decision-making; and (4) the provision of a comprehensive evaluation benchmark through baseline experiments across all task categories. Future work will focus on extending the dataset to temporal and video-based RS scenarios, incorporating dynamic decision-making processes, and further enhancing the reasoning and planning capabilities of RS agents in real-world, time-varying environments.
  • loading
  • [1]
    SUN Xian, WANG Peijin, LU Wanxuan, et al. RingMo: A remote sensing foundation model with masked image modeling[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5612822. doi: 10.1109/TGRS.2022.3194732.
    [2]
    HU Huiyang, WANG Peijin, BI Hanbo, et al. RS-vHeat: Heat conduction guided efficient remote sensing foundation model[C]. The IEEE/CVF International Conference on Computer Vision, Honolulu, The United States of America, 2025: 9876–9887.
    [3]
    CHANG Hao, WANG Peijin, DIAO Wenhui, et al. Remote sensing change detection with bitemporal and differential feature interactive perception[J]. IEEE Transactions on Image Processing, 2024, 33: 4543–4555. doi: 10.1109/TIP.2024.3424335.
    [4]
    SHI Qian, HE Da, LIU Zhengyu, et al. Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping[J]. Journal of Remote Sensing, 2023, 3: 0078. doi: 10.34133/remotesensing.0078.
    [5]
    HU Fengming, XU Feng, WANG R, et al. Conceptual study and performance analysis of tandem multi-antenna spaceborne SAR interferometry[J]. Journal of Remote Sensing, 2024, 4: 0137. doi: 10.34133/remotesensing.0137.
    [6]
    MEI Shaohui, LIAN Jiawei, WANG Xiaofei, et al. A comprehensive study on the robustness of deep learning-based image classification and object detection in remote sensing: Surveying and benchmarking[J]. Journal of Remote Sensing, 2024, 4: 0219. doi: 10.34133/remotesensing.0219.
    [7]
    CHEN Jun, ZHU Deyao, SHEN Xiaoqian, et al. MiniGPT-v2: Large language model as a unified interface for vision-language multi-task learning[J]. arXiv: 2310.09478, 2023. doi: 10.48550/arXiv.2310.09478.
    [8]
    PENG Zhiliang, WANG Wenhui, DONG Li, et al. Kosmos-2: Grounding multimodal large language models to the world[J]. arXiv: 2306.14824, 2023. doi: 10.48550/arXiv.2306.14824.
    [9]
    CHEN Keqin, ZHANG Zhao, ZENG Weili, et al. Shikra: Unleashing multimodal LLM's referential dialogue magic[J]. arXiv: 2306.15195, 2023. doi: 10.48550/arXiv.2306.15195.
    [10]
    LIU Chenyang, ZHAO Rui, CHEN Hao, et al. Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5633520. doi: 10.1109/TGRS.2022.3218921.
    [11]
    LOBRY S, MARCOS D, MURRAY J, et al. RSVQA: Visual question answering for remote sensing data[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(12): 8555–8566. doi: 10.1109/TGRS.2020.2988782.
    [12]
    CHENG Qimin, HUANG Haiyan, XU Yuan, et al. NWPU-captions dataset and MLCA-net for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5629419. doi: 10.1109/TGRS.2022.3201474.
    [13]
    LU Xiaoqiang, WANG Binqiang, ZHENG Xiangtao, et al. Exploring models and data for remote sensing image caption generation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2018, 56(4): 2183–2195. doi: 10.1109/TGRS.2017.2776321.
    [14]
    QU Bo, LI Xuelong, TAO Dacheng, et al. Deep semantic understanding of high resolution remote sensing image[C]. 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 2016: 1–5. doi: 10.1109/CITS.2016.7546397.
    [15]
    LI Kun, VOSSELMAN G, and YANG M Y. HRVQA: A visual question answering benchmark for high-resolution aerial images[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2024, 214: 65–81. doi: 10.1016/j.isprsjprs.2024.06.002.
    [16]
    HU Yuan, YUAN Jianlong, WEN Congcong, et al. RSGPT: A remote sensing vision language model and benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 224: 272–286. doi: 10.1016/j.isprsjprs.2025.03.028.
    [17]
    LUO Junwei, PANG Zhen, ZHANG Yongjun, et al. SkySenseGPT: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding[J]. arXiv: 2406.10100, 2024. doi: 10.48550/arXiv.2406.10100.
    [18]
    LI Xiang, DING Jian, and MOHAMED E. VRSBench: A versatile vision-language benchmark dataset for remote sensing image understanding[C]. The 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 106.
    [19]
    LEE J, MIYANISHI T, KURITA S, et al. CityNav: A large-scale dataset for real-world aerial navigation[C]. The IEEE/CVF International Conference on Computer Vision, Honolulu, The United States of America, 2025: 5912–5922.
    [20]
    LI Ke, WAN Gang, CHENG Gong, et al. Object detection in optical remote sensing images: A survey and a new benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 159: 296–307. doi: 10.1016/j.isprsjprs.2019.11.023.
    [21]
    LI Yuxuan, LI Xiang, LI Weijie, et al. SARDet-100K: Towards open-source benchmark and toolkit for large-scale SAR object detection[C]. The 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 4079.
    [22]
    XIA Guisong, HU Jingwen, HU Fan, et al. AID: A benchmark data set for performance evaluation of aerial scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(7): 3965–3981. doi: 10.1109/TGRS.2017.2685945.
    [23]
    CHENG Gong, HAN Junwei, and LU Xiaoqiang. Remote sensing image scene classification: Benchmark and state of the art[J]. Proceedings of the IEEE, 2017, 105(10): 1865–1883. doi: 10.1109/JPROC.2017.2675998.
    [24]
    YAN Qiwei, DENG Chubo, LIU Chenglong, et al. ReCon1M: A large-scale benchmark dataset for relation comprehension in remote sensing imagery[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 4507022. doi: 10.1109/TGRS.2025.3589986.
    [25]
    XIA Guisong, BAI Xiang, DING Jian, et al. DOTA: A large-scale dataset for object detection in aerial images[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 3974–3983. doi: 10.1109/cvpr.2018.00418.
    [26]
    LIU Haotian, LI Chunyuan, LI Yuheng, et al. Improved baselines with visual instruction tuning[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 26286–26296. doi: 10.1109/CVPR52733.2024.02484.
    [27]
    SUO Jiashun, WANG Tianyi, ZHANG Xingzhou, et al. HIT-UAV: A high-altitude infrared thermal dataset for Unmanned Aerial Vehicle-based object detection[J]. Scientific Data, 2023, 10(1): 227. doi: 10.1038/s41597-023-02066-6.
    [28]
    李春柳, 王水根. 红外海上船舶数据集[EB/OL]. https://openai.raytrontek.com/apply/Sea_shipping.html, 2021.
    [29]
    刘晴, 徐召飞, 金荣璐, 等. 红外安防数据库[EB/OL]. https://openai.raytrontek.com/apply/Infrared_security.html, 2021.
    [30]
    刘晴, 徐召飞, 王水根. 红外航拍人车检测数据集[EB/OL]. http://openai.raytrontek.com/apply/Aerial_mancar.html, 2021.
    [31]
    李钢强, 王建生, 王水根. 双光车载场景数据库[EB/OL]. http://openai.raytrontek.com/apply/Double_light_vehicle.html, 2021.
    [32]
    李永富, 赵显, 刘兆军, 等. 远海(10–12km)船舶的目标检测数据集[EB/OL]. http://www.core.sdu.edu.cn/info/1133/2174.htm, 2020.
    [33]
    GRATTAFIORI A, DUBEY A, JAUHRI A, et al. The llama 3 herd of models[J]. arXiv: 2407.21783, 2024. doi: 10.48550/arXiv.2407.21783.
    [34]
    HU Huiyang, WANG Peijin, FENG Yingchao, et al. RingMo-Agent: A unified remote sensing foundation model for multi-platform and multi-modal reasoning[J]. arXiv: 2507.20776, 2025. doi: 10.48550/arXiv.2507.20776.
    [35]
    WANG Peijin, HU Huiyang, TONG Boyuan, et al. RingMoGPT: A unified remote sensing foundation model for vision, language, and grounded tasks[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5611320. doi: 10.1109/TGRS.2024.3510833.
    [36]
    WU Zhiyu, CHEN Xiaokang, PAN Zizheng, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding[J]. arXiv: 2412.10302, 2024. doi: 10.48550/arXiv.2412.10302.
    [37]
    ANDERSON P, WU Q, TENEY D, et al. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 3674–3683. doi: 10.1109/CVPR.2018.00387.
    [38]
    LIU Shuobo, ZHANG Hongsheng, QI Yuankai, et al. AeriaLVLN: Vision-and-language navigation for UAVs[C]. The IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 15338–15348. doi: 10.1109/ICCV51070.2023.01411.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(6)  / Tables(15)

    Article Metrics

    Article views (351) PDF downloads(73) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return