潜在扩散模型驱动的航空时序图像生成方法

商钰滢; 侯英妍; 刘子楠; 卢宛萱; 黄宇鸿; 王逸潇; 于泓峰; 付琨

doi:10.11999/JEIT260165

潜在扩散模型驱动的航空时序图像生成方法

doi: 10.11999/JEIT260165 cstr: 32379.14.JEIT260165

商钰滢^{1, 2, 3, 4},
侯英妍^{1, 2, 3, 4, ,},
刘子楠^{1, 2, 3},
卢宛萱^{1, 4},
黄宇鸿^{5, 1, 3, 4},
王逸潇^{1, 2, 3, 4},
于泓峰^{1, 4},
付琨^{1, 3, 4}

1.
中国科学院空天信息创新研究院北京 100094
2.
中国科学院大学电子电气与通信工程学院北京 100190
3.
中国科学院大学北京 100190
4.
中国科学院空天信息创新研究院，目标认知与应用技术重点实验室北京 100094
5.
中国科学院大学前沿交叉科学学院北京 100190

基金项目: 中国科学院战略性先导科技专项(XDB1600000, XDB1600102)

详细信息

作者简介:
商钰滢：女，博士生，研究方向为视觉语言理解，多模态语义对齐

侯英妍：女，博士生，研究方向为遥感图像解译与生成

刘子楠：男，博士生，研究方向为SAR图像三维重建

卢宛萱：女，副研究员，研究方向为遥感图像解译

黄宇鸿：女，博士生，研究方向为遥感图像解译

王逸潇：男，研究员，研究方向为遥感图像解译

于泓峰：男，副研究员，研究方向为遥感图像解译

付琨：男，研究员，研究方向为遥感图像解译，地理空间大数据挖掘

通讯作者:
侯英妍　houyy@aircas.ac.cn

中图分类号: TP391; TP183
计量
- 文章访问数: 162
- HTML全文浏览量: 98
- PDF下载量: 15
- 被引次数: 0
出版历程
- 收稿日期: 2026-02-06
- 修回日期: 2026-03-09
- 录用日期: 2026-04-09
- 网络出版日期: 2026-04-13

Aerial Spatio-Temporal Image Generation via Latent Diffusion Models

SHANG Yuying^{1, 2, 3, 4},
HOU Yingyan^{1, 2, 3, 4
, ,},
LIU Zinan^{1, 2, 3},
LU Wanxuan^{1, 4},
HUANG Yuhong^{5, 1, 3, 4},
WANG Yixiao^{1, 2, 3, 4},
YU Hongfeng^{1, 4},
FU Kun^{1, 3, 4}

1.
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, 100094, China
2.
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, 100190, China
3.
University of Chinese Academy of Sciences, Beijing, 100190, China
4.
The Key Laboratory of Target Cognition and Application Technology (TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, 100094, China
5.
School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, 100190, China

Funds: The Strategic Priority Research Program of the Chinese Academy of Sciences (XDB1600000, XDB1600102)

摘要

摘要: 航空对地观测在环境监测、灾害预警和城市规划等领域发挥着关键作用。然而，受限于飞行平台续航能力、任务窗口时效性等现实约束，所获取的航空图像往往难以完整刻画地表的长期演化过程。尽管预训练扩散模型在图像生成领域展现出巨大潜力，但其在航空领域的应用仍面临严峻挑战。一方面，航空观测易受天气及飞行条件制约，高质量的长时序标注样本获取困难，难以支撑模型的有效训练；另一方面，航空平台飞行高度灵活多变，地物尺度跨度大，使得文本描述语义难以与多尺度视觉特征精确匹配，进而引发语义-视觉对齐偏差问题。为应对上述挑战，该文提出一个用于航空时序图像生成的免训练框架(ASTIG)。该框架首先利用视觉语言模型构建航空时序地表演变描述，并引入动态语义分解过程将复杂的时序描述转化为帧级视觉提示，为推理阶段提供细粒度的语义引导。随后，创新性地提出一种语言绑定策略，在扩散模型的交叉注意力机制中建立文本中关键地物与对应视觉属性的显式关联，从而增强模型对语义特征的捕捉能力。最后，集成时序锚点注意力机制，利用双参考帧约束确保主体与背景在帧间的时序一致性，有效抑制帧间的时序内容偏移问题。定性和定量实验结果表明，所提方法在时序连续性及语义保真度上均优于基线模型，为航空时序图像生成提供了新范式。
- 航空时序图像 /
- 扩散模型 /
- 文本图像生成 /
- 大语言模型 /
- 免训练生成
Abstract: Objective Aerial Earth observation plays a pivotal role in environmental monitoring, disaster warning, and urban planning. However, constraints such as flight-platform endurance and mission-window timeliness often prevent acquired aerial imagery from fully characterizing the long-term evolution of the Earth's surface. Although pre-trained latent diffusion models have shown strong potential for image generation, their application in aerial scenarios remains challenging because of the scarcity of high-quality temporal annotation data and semantic-visual misalignment caused by variable observation scales. To address these challenges, this paper proposes ASTIG, a training-free framework for Aerial Spatio-Temporal Image Generation. By leveraging the generative priors of pre-trained latent diffusion models and Large Language Models (LLMs), ASTIG provides a new paradigm for semantically controllable aerial spatio-temporal image generation. Methods ASTIG consists of three coordinated components. First, a dynamic semantic decomposition process is proposed to parse complex descriptions of aerial scene evolution into frame-level visual prompts, thereby compensating for the lack of temporal semantic annotations in existing aerial image-text datasets. Second, a Linguistic Binding (LB) strategy is proposed to establish explicit associations between key ground objects and their corresponding visual attributes within the cross-attention mechanism of the diffusion model, thereby improving the semantic response precision of the generated images. Third, a Temporal Anchor Attention (TAA) mechanism is incorporated. It uses dual reference frames to maintain subject stability and background consistency across the generated spatio-temporal image sequence, thus suppressing inter-frame temporal drift under training-free conditions. Results and Discussions ASTIG and the baseline methods are evaluated on 7,236 high-quality aerial spatio-temporal descriptions using six automated metrics, including subject consistency, background consistency, temporal flickering, motion smoothness, aesthetic quality, and imaging quality. Quantitative results (Tables 1 and 2) show that ASTIG outperforms the baseline methods in spatio-temporal image generation, with improvements of 3.91% in subject consistency and 4.57% in motion smoothness over the frame-prompt baseline. Qualitative comparisons (Fig. 4) further show its strong ability to model long-term surface evolution in aerial imagery. Ablation studies validate the individual effectiveness of the LB strategy and the TAA mechanism (Table 3 and Fig. 5). Sensitivity analyses of the intervention steps (Table 4 and Fig. 6) and binding strength (Table 5 and Fig. 7) further identify suitable parameter settings. Extension experiments from satellite perspectives (Figs. 8 and 9) also show that ASTIG has the potential to generalize beyond aerial platforms to broader Earth observation scenarios. Conclusions This paper proposes ASTIG, a training-free framework for aerial spatio-temporal image generation that addresses the scarcity of high-quality long-term temporal data and semantic-visual misalignment. By leveraging the generative priors of pre-trained latent diffusion models and LLMs, ASTIG integrates a dynamic semantic decomposition process, an LB strategy, and a TAA mechanism to improve temporal semantic construction, semantic response precision, and inter-frame consistency. Experimental results show that ASTIG outperforms existing baseline methods across multiple automated evaluation metrics, providing a new paradigm for aerial spatio-temporal image generation. As a training-free method, ASTIG is still limited by the prior knowledge of the backbone model. Future work will examine geometric correction and nadir-view prior constraints to better align the generated results with the physical properties of satellite imagery.
- Aerial spatio-temporal image generation /
- Latent diffusion models /
- Text-to-image generation /
- Large Language Models /
- Training-free generation

HTML全文

图 1 航空时序图像生成的两大核心难点

下载: 全尺寸图片幻灯片

图 2 ASTIG框架图

下载: 全尺寸图片幻灯片

图 3 航空时序描述数据的收集流程

下载: 全尺寸图片幻灯片

图 4 基于“夏秋季节过渡：岩石半岛延伸入海，顶端可见圆形平顶建筑”文本提示的各方法定性对比结果

下载: 全尺寸图片幻灯片

图 5 针对“黄昏雪景：网格状建筑群，标志性建筑展现出金色穹顶及古典建筑元素”文本提示的各模块定性消融实验结果

下载: 全尺寸图片幻灯片

图 6 针对“悬崖海浪活动”文本提示的干预步数$ N\mathrm{_{inv}} $定性消融实验结果

下载: 全尺寸图片幻灯片

图 7 针对“日落时分密集城市天际线”文本提示的绑定强度$ \eta $定性消融结果

下载: 全尺寸图片幻灯片

图 8 卫星视角下ASTIG与Text2Earth的文本驱动图像生成对比结果

下载: 全尺寸图片幻灯片

图 9 卫星视角下ASTIG对多种地表变化场景的时序图像生成结果

下载: 全尺寸图片幻灯片

表 1 单提示设置下本文方法与基线方法的定量对比结果(%)

方法	SC	BC	TF	MS	IQ	AQ
LaVie^[19]	94.87	96.31	95.12	96.89	63.15	50.23
AnimateLCM^[34]	96.13	97.68	94.07	95.94	53.48	52.87
CogVideoX^[20]	95.58	97.48	98.54	98.76	49.57	48.83
Videoelevator (LaVie)^[35]	92.34	95.18	93.21	93.45	65.53	60.18
Videoelevator (AnimateLCM)^[35]	93.96	96.23	92.43	93.57	57.14	58.23
Text2Video-Zero^[5]	89.55	93.89	85.03	87.64	74.82	51.23
Free-Bloom_s^[22]	96.68	97.13	93.89	94.21	77.19	61.36
ASTIG_s	97.01	97.34	96.47	97.24	78.03	61.72

下载: 导出CSV

表 2 帧提示设置下本文方法与基线方法的定量对比结果(%)

方法	SC	BC	TF	MS	IQ	AQ
Free-Bloom	92.47	94.39	92.14	93.38	76.58	62.37
ASTIG	96.38	95.87	96.71	96.02	77.03	62.80

下载: 导出CSV

表 3 各模块消融定量实验结果(%)

方法	SC	BC	TF	MS	IQ	AQ
ASTIG	96.02	95.87	96.71	96.02	77.03	62.80
w/o LB	95.53	95.67	95.84	95.12	75.21	57.58
w/o TAA	72.15	85.18	86.84	88.67	76.82	56.23
w/o $ \text{At}{\mathrm{t}}_{\text{former}} $	96.38	95.83	96.38	95.74	76.98	61.36
w/o $ {\text{Att}}_{\text{first}} $	89.21	91.58	94.72	96.53	75.98	59.78

下载: 导出CSV

表 4 干预步数$ N\mathrm{_{inv}} $ 的消融定量实验结果(%)

$ N\mathrm{_{inv}} $	SC	BC	TF	MS	IQ	AQ
1	96.38	95.87	96.71	96.02	77.03	62.80
2	96.78	97.56	96.41	97.26	76.86	59.65
3	97.41	97.86	96.81	97.56	75.66	58.77
4	97.84	97.91	97.31	97.89	74.64	56.46

下载: 导出CSV

表 5 绑定强度 $ \eta $的消融定量实验结果(%)

	SC	BC	TF	MS	IQ	AQ
0	96.43	95.47	95.42	96.81	76.03	60.09
30	96.58	95.28	96.54	97.33	76.92	61.93
50	96.38	95.87	96.71	96.02	77.54	62.80
70	96.52	96.39	95.56	97.67	77.31	61.75
100	95.97	96.28	95.43	97.02	76.35	61.86

下载: 导出CSV

参考文献(35)

[1]	MOHSAN S A H, OTHMAN N Q H, LI Yanlong, et al. Unmanned aerial vehicles (UAVs): Practical aspects, applications, open challenges, security issues, and future trends[J]. Intelligent Service Robotics, 2023, 16(1): 109–137. doi: 10.1007/s11370-022-00452-4.
[2]	LIU Yidan, YUE Jun, XIA Shaobo, et al. Diffusion models meet remote sensing: Principles, methods, and perspectives[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 4708322. doi: 10.1109/TGRS.2024.3464685.
[3]	ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10674–10685. doi: 10.1109/CVPR52688.2022.01042.
[4]	赵宏, 李文改. 基于扩散生成对抗网络的文本生成图像模型研究[J]. 电子与信息学报, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400. ZHAO Hong and LI Wengai. Text-to-image generation model based on diffusion wasserstein generative adversarial networks[J]. Journal of Electronics & Information Technology, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400.
[5]	KHACHATRYAN L, MOVSISYAN A, TADEVOSYAN V, et al. Text2Video-zero: Text-to-image diffusion models are zero-shot video generators[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 15908–15918. doi: 10.1109/ICCV51070.2023.01462.
[6]	WU J Z, GE Yixiao, WANG Xintao, et al. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 7589–7599. doi: 10.1109/ICCV51070.2023.00701.
[7]	KINGMA D P and WELLING M. Auto-encoding variational Bayes[C]. The 2nd International Conference on Learning Representations, Banff, Canada, 2014.
[8]	GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139–144. doi: 10.1145/3422622.
[9]	宋淼, 陈志强, 王培松, 等. DetDiffRS: 面向细节优化的遥感图像超分辨率扩散模型[J]. 电子与信息学报, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995. SONG Miao, CHEN Zhiqiang, WANG Peisong, et al. DetDiffRS: A detail-enhanced diffusion model for remote sensing image super-resolution[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995.
[10]	BEJIGA M B, MELGANI F, and VASCOTTO A. Retro-remote sensing: Generating images from ancient texts[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019, 12(3): 950–960. doi: 10.1109/JSTARS.2019.2895693.
[11]	CHRISTIE G, FENDLEY N, WILSON J, et al. Functional map of the world[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6172–6180. doi: 10.1109/CVPR.2018.00646.
[12]	VAN ETTEN A, LINDENBAUM D, and BACASTOW T M. SpaceNet: A remote sensing dataset and challenge series[J]. arXiv preprint arXiv: 1807.01232, 2018. doi: 10.48550/arXiv.1807.01232.
[13]	LIU Chenyang, CHEN Keyan, ZHAO Rui, et al. Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model[J]. IEEE Geoscience and Remote Sensing Magazine, 2025, 13(3): 238–259. doi: 10.1109/MGRS.2025.3560455.
[14]	TANG Datao, CAO Xiangyong, WU Xuan, et al. AeroGen: Enhancing remote sensing object detection with diffusion-driven data generation[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2025: 3614–3624. doi: 10.1109/CVPR52734.2025.00342.
[15]	邓梓焌, 何相腾, 彭宇新. 文本到视频生成: 研究现状、进展和挑战[J]. 电子与信息学报, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074. DENG Zijun, HE Xiangteng, and PENG Yuxin. Text-to-video generation: Research status, progress and challenges[J]. Journal of Electronics & Information Technology, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074.
[16]	QASIM I, HORSCH A, and PRASAD D. Dense video captioning: A survey of techniques, datasets and evaluation protocols[J]. ACM Computing Surveys, 2025, 57(6): 154. doi: 10.1145/3712059.
[17]	HO J, SALIMANS T, GRITSENKO A, et al. Video diffusion models[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 628.
[18]	BLATTMANN A, ROMBACH R, LING Huan, et al. Align your latents: High-resolution video synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 22563–22575. doi: 10.1109/CVPR52729.2023.02161.
[19]	WANG Yaohui, CHEN Xinyuan, MA Xin, et al. LaVie: High-quality video generation with cascaded latent diffusion models[J]. International Journal of Computer Vision, 2025, 133(5): 3059–3078. doi: 10.1007/s11263-024-02295-1.
[20]	YANG Zhuoyi, TENG Jiayan, ZHENG Wendi, et al. CogVideoX: Text-to-video diffusion models with an expert transformer[C]. The 13th International Conference on Learning Representations, Singapore, Singapore, 2025.
[21]	GUO Yuwei, YANG Ceyuan, RAO Anyi, et al. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning[C]. The 12th International Conference on Learning Representations, Vienna, Austria, 2024.
[22]	HUANG Hanzhuo, FENG Yufan, SHI Cheng, et al. Free-bloom: Zero-shot text-to-video generator with LLM director and LDM animator[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 1138.
[23]	ZHU Hanxin, HE Tianyu, TANG Anni, et al. Compositional 3D-aware video generation with LLM director[C]. The 38th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. , 2024: 4184.
[24]	MA Yong, HAN Huasong, SHEN Shiyuan, et al. T2CV-zero: Zero-shot character video generation via text-to-motion model[C]. International Joint Conference on Neural Networks, Rome, Italy, 2025: 1–8. doi: 10.1109/IJCNN64981.2025.11229098.
[25]	LI Zhen, LI Chuanhao, MAO Xiaofeng, et al. Sekai: A video dataset towards world exploration[J]. arXiv preprint arXiv: 2506.15675, 2025. doi: 10.48550/arXiv.2506.15675.
[26]	WEN Longyin, DU Dawei, ZHU Pengfei, et al. Detection, tracking, and counting meets drones in crowds: A benchmark[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 7808–7817. doi: 10.1109/CVPR46437.2021.00772.
[27]	MOU Lichao, HUA Yuansheng, JIN Pu, et al. ERA: A data set and deep learning benchmark for event recognition in aerial videos [software and data sets][J]. IEEE Geoscience and Remote Sensing Magazine, 2020, 8(4): 125–133. doi: 10.1109/MGRS.2020.3005751.
[28]	WEI J, WANG Xuezhi, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1800.
[29]	RASSIN R, HIRSCH E, GLICKMAN D, et al. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 157.
[30]	SONG Jiaming, MENG Chenlin, and ERMON S. Denoising diffusion implicit models[C/OL]. The 9th International Conference on Learning Representations, 2021.
[31]	BAI Shuai, CAI Yuxuan, CHEN Ruizhe, et al. Qwen3-VL technical report[J]. arXiv preprint arXiv: 2511.21631, 2025. doi: 10.48550/arXiv.2511.21631.
[32]	Gemini Team Google. Gemini: A family of highly capable multimodal models[J]. arXiv preprint arXiv: 2312.11805, 2025. doi: 10.48550/arXiv.2312.11805.
[33]	HUANG Ziqi, HE Yinan, YU Jiashuo, et al. VBench: Comprehensive benchmark suite for video generative models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 21807–21818. doi: 10.1109/CVPR52733.2024.02060.
[34]	WANG Fuyun, HUANG Zhaoyang, BIAN Weikang, et al. AnimateLCM: Computation-efficient personalized style video generation without personalized video data[C]. SIGGRAPH Asia 2024 Technical Communications, Tokyo, Japan, 2024: 23. doi: 10.1145/3681758.3698013.
[35]	ZHANG Yabo, WEI Yuxiang, LIN Xianhui, et al. VideoElevator: Elevating video generation quality with versatile text-to-image diffusion models[C]. The 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 10266–10274. doi: 10.1609/aaai.v39i10.33114.