高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

面向遥感智能体的多模态图文指令大规模数据集

王佩瑾 胡会扬 冯瑛超 刁文辉 孙显

王佩瑾, 胡会扬, 冯瑛超, 刁文辉, 孙显. 面向遥感智能体的多模态图文指令大规模数据集[J]. 电子与信息学报. doi: 10.11999/JEIT250818
引用本文: 王佩瑾, 胡会扬, 冯瑛超, 刁文辉, 孙显. 面向遥感智能体的多模态图文指令大规模数据集[J]. 电子与信息学报. doi: 10.11999/JEIT250818
WANG Peijin, HU Huiyang, FENG Yingchao, DIAO Wenhui, SUN Xian. A Large-Scale Multimodal Instruction Dataset for Remote Sensing Agents[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250818
Citation: WANG Peijin, HU Huiyang, FENG Yingchao, DIAO Wenhui, SUN Xian. A Large-Scale Multimodal Instruction Dataset for Remote Sensing Agents[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250818

面向遥感智能体的多模态图文指令大规模数据集

doi: 10.11999/JEIT250818 cstr: 32379.14.JEIT250818
基金项目: 中国科学院空天信息创新研究院科学与颠覆性技术项目(2025-AIRCAS-SDTP-04)
详细信息
    作者简介:

    王佩瑾:女,助理研究员,研究方向为遥感图像智能解译

    胡会扬:女,博士生,研究方向为遥感图像智能解译

    冯瑛超:男,助理研究员,研究方向为遥感图像智能解译

    刁文辉:男,副研究员,研究方向为遥感图像智能解译

    孙显:男,研究员,研究方向为计算机视觉与遥感图像理解

    通讯作者:

    胡会扬 huhuiyang22@mails.ucas.as.cn

  • 中图分类号: XXXXXXXXXXXX

A Large-Scale Multimodal Instruction Dataset for Remote Sensing Agents

Funds: The Science and Disruptive Technology Program, AIRCAS (2025-AIRCAS-SDTP-04)
  • 摘要: 随着遥感应用不断从静态图像分析迈向智能化认知决策任务,构建覆盖多任务、多模态的信息融合数据体系已成为推动遥感基础模型发展的关键前提。该文围绕遥感智能体中的感知、认知需求,提出并构建了一个面向多任务图文指令的遥感多模态数据集,系统组织图像、文本指令、空间坐标与行为轨迹等多模态信息,统一支撑多阶段任务链路的训练与评估。该数据集涵盖9类核心任务,包括关系推理、指令分解、任务调度、定位描述与多模态感知等,共计21个子数据集,覆盖光学、SAR与红外3类遥感模态,总体数据规模超过2 000 000样本。在数据构建过程中,该文针对遥感图像的特性设计了标准化的指令格式,提出统一的输入输出范式,确保不同任务间的互通性与可迁移性。同时,设计自动化数据生成与转换流程,以提升多模态样本生成效率与一致性。此外,该文还介绍了在遥感基础模型上的基线性能评估结果,验证了该数据集在多任务泛化学习中的实用价值。该数据集可广泛服务于遥感领域多模态基础模型的构建与评估,尤其适用于统一感知-认知-决策闭环流程的智能体模型开发,具有良好的研究推广价值与工程应用前景。
  • 图  1  关系推理任务问答对示例

    图  2  关系检测任务问答对示例

    图  3  指令分解任务问答对示例

    图  4  任务调度任务问答对示例

    图  5  定位描述任务问答对示例

    图  6  多模态感知任务问答对示例

    表  1  数据集整体统计

    任务类型 数据模态 数据集名称 数量规模
    任务调度 光学 Citynav[19] 32637
    指令分解 光学 ReCon1M-DEC 27821
    关系推理 光学 ReCon1M-REL 125000
    关系检测 光学 ReCon1M-DET 120097
    定位描述 光学 DIOR-GC 22921
    光学 DOTA-GC 48866
    多模态感知
    (目标检测)
    光学 DIOR[20] 23463
    SAR SARDet-100K[21] 116598
    红外 IR-DET 56353
    多模态感知
    (图像描述)
    光学 DIOR-CAP 92875
    光学 DOTA-CAP 307150
    SAR SAR-CAP 582990
    红外 IR-CAP 281765
    多模态感知
    (图像分类)
    光学 AID[22] 10000
    光学 NWPU-RESISC45[23] 31500
    SAR SAR-CLA 116597
    红外 IR-CLA 56353
    多模态感知
    (目标计数)
    光学 DIOR-COUNT 35204
    光学 DOTA-COUNT 78432
    SAR IR-COUNT 107565
    红外 SAR-COUNT 117803
    总数 2391990
    下载: 导出CSV

    表  2  关系推理数据集中各类别关系的样本分布统计表

    类别名训练集测试集类别名训练集测试集类别名训练集测试集类别名训练集测试集
    above695dig10link-to16621pull51
    adjacent-to2799289dock-alone-at41load90sail-by204
    adjoint-with31dock-at10213manage354sail-on922123
    around30drive-at-the-
    different-lane
    14820moor-at3414383separate311
    belong-to913112drive-at-the-same-lane14417move-away-from80serve1737174
    block21drive-on4931538park-alone-at143slow-down24134
    border729emit10park-next-to305443389supplement33450
    close-to249922657enter60parked-at154321735supply17926
    command30equipped-with668pass-under30support797
    connect18615exit-from51pile-up-at66083support-the-
    construction-of
    829
    contain10614hoist9310placed-on292taxi-on22622
    converge141inside7558854power1399tow103
    cooperate-with43047is-parallel-to3516402prepared-for8114transport486
    cross1224130is-symmetric-with41provide-access-to105841247ventilate442
    cultivate100lie-under131provide-shuttle-
    service-to
    61
    下载: 导出CSV

    表  3  关系检测数据集中各类别关系的样本分布统计表

    类别名训练集测试集类别名训练集测试集类别名训练集测试集类别名训练集测试集
    above737dock-alone-at50manage295sail-by223
    adjacent-to2575297dock-at807moor-at3224344sail-on63678
    adjoint-with40drive-at-the-
    different-lane
    11215move-away-from50separate394
    around42drive-at-the-
    same-lane
    15418park-alone-at100serve1655186
    belong-to76866drive-on4819545park-next-to298573249slow-down24534
    block10enter80parked-at153081729supplement32137
    border527equipped-with599pass-under30supply20716
    close-to239422713exit-from50pile-up-at59076support616
    command20hoist1157placed-on250support-the-
    construction-of
    111
    connect16916inside7556860power999taxi-on15513
    contain1117is-parallel-to3088345prepared-for6411tow70
    converge92is-symmetric-with20provide-access-to99891136transport282
    cooperate-with38439lie-under144provide-shuttle-service-to50ventilate302
    cross1151118link-to16814pull21
    下载: 导出CSV

    表  4  指令分解数据集中各类别关系的样本分布统计表

    类别名训练集测试集类别名训练集测试集类别名训练集测试集类别名训练集测试集
    above19548dig327link-to7824pull10
    adjacent-to79602014dock-alone-at225load7021sail-by9511
    adjoint-with204dock-at17234manage10820sail-on4750890
    around105drive-at-the-
    different-lane
    21048moor-at107402751separate144
    belong-to39571108drive-at-the-same-lane4218move-away-from353serve2623681
    block110drive-on91482119park-alone-at418slow-down417103
    border31997emit124park-next-to7192617232supplement475124
    close-to408409980enter304parked-at356238641supply403115
    command349equipped-with8921pass-under288support9438
    connect618170exit-from60pile-up-at2070514support-the-
    construction-of
    619
    contain44125hoist26666placed-on13920taxi-on1175320
    converge175inside125663607power21247tow4810
    cooperate-with998230is-parallel-to94282578prepared-for13931transport19542
    cross1534364is-symmetric-with44provide-access-to123623009ventilate8325
    cultivate9728lie-under429provide-shuttle-
    service-to
    4434
    下载: 导出CSV

    表  5  定位描述数据集信息统计表

    训练集句子数量单词数量平均句子长度
    DIOR-GC训练集1138115180413.34
    DIOR-GC测试集1154016424814.23
    DOTA-GC训练集3761793204924.78
    DOTA-GC测试集1124923933121.28
    总数71787148743218.41
    下载: 导出CSV

    表  6  多模态感知中图像描述数据集统计

    数据类型 训练集/测试集 句子数量 单词数量 平均句子长度
    SAR SAR-CAP训练集 524925 4950300 9.43
    SAR-CAP测试集 58065 550216 9.48
    总数 582990 5500516 9.46
    红外 IR-CAP训练集 132410 1330173 10.05
    IR-CAP测试集 14735 148973 10.11
    总数 147145 1479146 10.08
    可见光 DIOR-CAP训练集 11725 157232 13.41
    DIOR-CAP测试集 57700 738907 12.81
    DOTA-CAP训练集 83635 1486176 17.77
    DOTA-CAP测试集 56245 1090289 19.38
    总数 209305 3472604 15.84
    下载: 导出CSV

    表  7  多模态感知中图像分类数据集统计

    数据类型 类别名称 训练集 测试集 训练/测试比例
    SAR Aircraft 3037 16835 0.1804
    Aircraft and tank 3 4 0.7500
    Bridge 1697 16168 0.1050
    Bridge and harbor 16 131 0.1221
    Bridge and ship 13 230 0.0565
    Bridge and tank 33 329 0.1003
    Bridge, harbor and tank 3 25 0.1200
    Car 103 941 0.1095
    Harbor 138 1255 0.1100
    Harbor and tank 9 82 0.1098
    Ship 6470 67211 0.0963
    Ship and tank 13 287 0.0453
    Tank 77 1487 0.0518
    总数 11612 104985 0.1505
    红外 Ship 1593 14354 0.1110
    Street 4036 36370 0.1110
    总数 5629 50724 0.1110
    下载: 导出CSV

    表  8  关系推理任务在ReCon1M-REL数据集上的结果

    模型类型 模型方法 F1-Score↑
    通用视觉语言模型(零样本评估) MiniGPT-v2[7] 0.00
    DeepSeek-VL2[36] 0.30
    遥感视觉语言模型(微调后评估) RingMo-Agent[34] 90.23
    下载: 导出CSV

    表  9  指令分解任务在ReCon1M-DEC数据集上的结果

    模型类型 模型方法 mAP50↑ F1-Score↑
    通用视觉语言模型
    (微调后评估)
    MiniGPT-v2[7] 11.50 15.19
    DeepSeek-VL2[36] 19.80 10.32
    遥感视觉语言模型
    (微调后评估)
    RingMo-Agent[34] 24.20 32.85
    下载: 导出CSV

    表  10  任务调度任务在CityNav数据集上的结果

    模型类型 模型方法 测试集
    NE↓ SR↑ OSR↑ SPL↑
    专业模型(微调后评估) Seq2Seq[37] 245.30 1.50 8.34 1.30
    CMA[38] 252.60 0.82 9.70 0.79
    AerialVLN + GSM[19] 85.10 6.72 18.21 5.16
    遥感视觉语言模型(微调后评估) RingMo-Agent[34] 149.60 4.74 18.94 4.17
    下载: 导出CSV

    表  11  定位描述任务在DIOR-GC和DOTA-GC数据集上的结果

    模型类型 模型方法 BLEU-1↑ BLEU-2↑ BLEU-3↑ BLEU-4↑ METEOR↑ ROUGE-L↑ CIDEr↑ mAP50
    (DIOR-GC)↑
    mAP50
    (DOTA-GC)↑
    遥感视觉
    语言模型
    (微调后评估)
    RingMoGPT[35]
    基线
    67.5 54.2 43.6 34.8 28.7 58.6 92.7 44.9 35.6
    RingMoGPT[35] 68.5 55.1 44.2 35.7 29.4 57.3 94.1 65.3 46.5
    下载: 导出CSV

    表  12  多模态感知任务在SAR-CAP和IR-CAP数据集上的结果

    模型类型 模型方法 SAR-CAP IR-CAP
    BLEU-1↑ BLEU-2↑ BLEU-3↑ BLEU-4↑ METEOR↑ ROUGE-L↑ BLEU-1↑ BLEU-2↑ BLEU-3↑ BLEU-4↑ METEOR↑ ROUGE-L↑
    通用视觉
    语言模型
    (零样本评估)
    MiniGPT-v2[7] 7.00 3.64 1.59 0.60 7.67 9.25 5.65 3.35 1.92 0.98 7.62 8.67
    DeepSeek-VL2[36] 12.52 5.88 1.88 0.60 10.65 14.10 13.95 7.57 3.47 1.43 12.48 15.12
    遥感视觉
    语言模型
    (微调后评估)
    RingMo-Agent[34] 55.93 44.49 33.57 23.94 25.06 51.12 56.84 40.45 29.17 21.50 26.15 43.13
    下载: 导出CSV

    表  13  多模态感知任务在DIOR-CAP和DOTA-CAP数据集上的结果

    模型类型模型方法METEORROUGE-LCIDEr
    遥感视觉语言模型(微调后评估)RingMoGPT[35]基线28.8055.4072.50
    RingMoGPT[35]30.4060.1099.20
    下载: 导出CSV

    表  14  多模态感知任务在IR-DET数据集上的结果

    模型类型 模型方法 mAP50
    通用视觉语言模型(零样本评估) MiniGPT-v2[7] 0
    DeepSeek-VL2[36] 0
    遥感视觉语言模型(微调后评估) RingMo-Agent[34] 59.88
    下载: 导出CSV

    表  15  多模态感知任务在SAR-CLA和IR-CLA数据集上的结果

    模型类型 模型方法 SAR-CLA分类准确率 IR-CLA分类准确率
    通用视觉语言模型(零样本评估) MiniGPT-v2[7] 5.87 60.52
    DeepSeek-VL2[36] 4.40 45.15
    遥感视觉语言模型(微调后评估) RingMo-Agent[34] 92.67 99.45
    下载: 导出CSV
  • [1] SUN Xian, WANG Peijin, LU Wanxuan, et al. RingMo: A remote sensing foundation model with masked image modeling[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5612822. doi: 10.1109/TGRS.2022.3194732.
    [2] HU Huiyang, WANG Peijin, BI Hanbo, et al. RS-vHeat: Heat conduction guided efficient remote sensing foundation model[C]. The IEEE/CVF International Conference on Computer Vision, Honolulu, The United States of America, 2025: 9876–9887.
    [3] CHANG Hao, WANG Peijin, DIAO Wenhui, et al. Remote sensing change detection with bitemporal and differential feature interactive perception[J]. IEEE Transactions on Image Processing, 2024, 33: 4543–4555. doi: 10.1109/TIP.2024.3424335.
    [4] SHI Qian, HE Da, LIU Zhengyu, et al. Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping[J]. Journal of Remote Sensing, 2023, 3: 0078. doi: 10.34133/remotesensing.0078.
    [5] HU Fengming, XU Feng, WANG R, et al. Conceptual study and performance analysis of tandem multi-antenna spaceborne SAR interferometry[J]. Journal of Remote Sensing, 2024, 4: 0137. doi: 10.34133/remotesensing.0137.
    [6] MEI Shaohui, LIAN Jiawei, WANG Xiaofei, et al. A comprehensive study on the robustness of deep learning-based image classification and object detection in remote sensing: Surveying and benchmarking[J]. Journal of Remote Sensing, 2024, 4: 0219. doi: 10.34133/remotesensing.0219.
    [7] CHEN Jun, ZHU Deyao, SHEN Xiaoqian, et al. MiniGPT-v2: Large language model as a unified interface for vision-language multi-task learning[J]. arXiv: 2310.09478, 2023. doi: 10.48550/arXiv.2310.09478.
    [8] PENG Zhiliang, WANG Wenhui, DONG Li, et al. Kosmos-2: Grounding multimodal large language models to the world[J]. arXiv: 2306.14824, 2023. doi: 10.48550/arXiv.2306.14824.
    [9] CHEN Keqin, ZHANG Zhao, ZENG Weili, et al. Shikra: Unleashing multimodal LLM's referential dialogue magic[J]. arXiv: 2306.15195, 2023. doi: 10.48550/arXiv.2306.15195.
    [10] LIU Chenyang, ZHAO Rui, CHEN Hao, et al. Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5633520. doi: 10.1109/TGRS.2022.3218921.
    [11] LOBRY S, MARCOS D, MURRAY J, et al. RSVQA: Visual question answering for remote sensing data[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(12): 8555–8566. doi: 10.1109/TGRS.2020.2988782.
    [12] CHENG Qimin, HUANG Haiyan, XU Yuan, et al. NWPU-captions dataset and MLCA-net for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5629419. doi: 10.1109/TGRS.2022.3201474.
    [13] LU Xiaoqiang, WANG Binqiang, ZHENG Xiangtao, et al. Exploring models and data for remote sensing image caption generation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2018, 56(4): 2183–2195. doi: 10.1109/TGRS.2017.2776321.
    [14] QU Bo, LI Xuelong, TAO Dacheng, et al. Deep semantic understanding of high resolution remote sensing image[C]. 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 2016: 1–5. doi: 10.1109/CITS.2016.7546397.
    [15] LI Kun, VOSSELMAN G, and YANG M Y. HRVQA: A visual question answering benchmark for high-resolution aerial images[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2024, 214: 65–81. doi: 10.1016/j.isprsjprs.2024.06.002.
    [16] HU Yuan, YUAN Jianlong, WEN Congcong, et al. RSGPT: A remote sensing vision language model and benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 224: 272–286. doi: 10.1016/j.isprsjprs.2025.03.028.
    [17] LUO Junwei, PANG Zhen, ZHANG Yongjun, et al. SkySenseGPT: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding[J]. arXiv: 2406.10100, 2024. doi: 10.48550/arXiv.2406.10100.
    [18] LI Xiang, DING Jian, and MOHAMED E. VRSBench: A versatile vision-language benchmark dataset for remote sensing image understanding[C]. The 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 106.
    [19] LEE J, MIYANISHI T, KURITA S, et al. CityNav: A large-scale dataset for real-world aerial navigation[C]. The IEEE/CVF International Conference on Computer Vision, Honolulu, The United States of America, 2025: 5912–5922.
    [20] LI Ke, WAN Gang, CHENG Gong, et al. Object detection in optical remote sensing images: A survey and a new benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 159: 296–307. doi: 10.1016/j.isprsjprs.2019.11.023.
    [21] LI Yuxuan, LI Xiang, LI Weijie, et al. SARDet-100K: Towards open-source benchmark and toolkit for large-scale SAR object detection[C]. The 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 4079.
    [22] XIA Guisong, HU Jingwen, HU Fan, et al. AID: A benchmark data set for performance evaluation of aerial scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(7): 3965–3981. doi: 10.1109/TGRS.2017.2685945.
    [23] CHENG Gong, HAN Junwei, and LU Xiaoqiang. Remote sensing image scene classification: Benchmark and state of the art[J]. Proceedings of the IEEE, 2017, 105(10): 1865–1883. doi: 10.1109/JPROC.2017.2675998.
    [24] YAN Qiwei, DENG Chubo, LIU Chenglong, et al. ReCon1M: A large-scale benchmark dataset for relation comprehension in remote sensing imagery[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 4507022. doi: 10.1109/TGRS.2025.3589986.
    [25] XIA Guisong, BAI Xiang, DING Jian, et al. DOTA: A large-scale dataset for object detection in aerial images[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 3974–3983. doi: 10.1109/cvpr.2018.00418.
    [26] LIU Haotian, LI Chunyuan, LI Yuheng, et al. Improved baselines with visual instruction tuning[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 26286–26296. doi: 10.1109/CVPR52733.2024.02484.
    [27] SUO Jiashun, WANG Tianyi, ZHANG Xingzhou, et al. HIT-UAV: A high-altitude infrared thermal dataset for Unmanned Aerial Vehicle-based object detection[J]. Scientific Data, 2023, 10(1): 227. doi: 10.1038/s41597-023-02066-6.
    [28] 李春柳, 王水根. 红外海上船舶数据集[EB/OL]. https://openai.raytrontek.com/apply/Sea_shipping.html, 2021.
    [29] 刘晴, 徐召飞, 金荣璐, 等. 红外安防数据库[EB/OL]. https://openai.raytrontek.com/apply/Infrared_security.html, 2021.
    [30] 刘晴, 徐召飞, 王水根. 红外航拍人车检测数据集[EB/OL]. http://openai.raytrontek.com/apply/Aerial_mancar.html, 2021.
    [31] 李钢强, 王建生, 王水根. 双光车载场景数据库[EB/OL]. http://openai.raytrontek.com/apply/Double_light_vehicle.html, 2021.
    [32] 李永富, 赵显, 刘兆军, 等. 远海(10–12km)船舶的目标检测数据集[EB/OL]. http://www.core.sdu.edu.cn/info/1133/2174.htm, 2020.
    [33] GRATTAFIORI A, DUBEY A, JAUHRI A, et al. The llama 3 herd of models[J]. arXiv: 2407.21783, 2024. doi: 10.48550/arXiv.2407.21783.
    [34] HU Huiyang, WANG Peijin, FENG Yingchao, et al. RingMo-Agent: A unified remote sensing foundation model for multi-platform and multi-modal reasoning[J]. arXiv: 2507.20776, 2025. doi: 10.48550/arXiv.2507.20776.
    [35] WANG Peijin, HU Huiyang, TONG Boyuan, et al. RingMoGPT: A unified remote sensing foundation model for vision, language, and grounded tasks[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5611320. doi: 10.1109/TGRS.2024.3510833.
    [36] WU Zhiyu, CHEN Xiaokang, PAN Zizheng, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding[J]. arXiv: 2412.10302, 2024. doi: 10.48550/arXiv.2412.10302.
    [37] ANDERSON P, WU Q, TENEY D, et al. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 3674–3683. doi: 10.1109/CVPR.2018.00387.
    [38] LIU Shuobo, ZHANG Hongsheng, QI Yuankai, et al. AeriaLVLN: Vision-and-language navigation for UAVs[C]. The IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 15338–15348. doi: 10.1109/ICCV51070.2023.01411.
  • 加载中
图(6) / 表(15)
计量
  • 文章访问数:  386
  • HTML全文浏览量:  65
  • PDF下载量:  77
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-08-29
  • 修回日期:  2026-01-11
  • 录用日期:  2026-01-12
  • 网络出版日期:  2026-01-27

目录

    /

    返回文章
    返回