Aerial Spatio-Temporal Image Generation via Latent Diffusion Models
-
摘要: 航空对地观测在环境监测、灾害预警和城市规划等领域发挥着关键作用。然而,受限于飞行平台续航能力、任务窗口时效性等现实约束,所获取的航空图像往往难以完整刻画地表的长期演化过程。尽管预训练扩散模型在图像生成领域展现出巨大潜力,但其在航空领域的应用仍面临严峻挑战。一方面,航空观测易受天气及飞行条件制约,高质量的长时序标注样本获取困难,难以支撑模型的有效训练;另一方面,航空平台飞行高度灵活多变,地物尺度跨度大,使得文本描述语义难以与多尺度视觉特征精确匹配,进而引发语义-视觉对齐偏差问题。为应对上述挑战,该文提出一个用于航空时序图像生成的免训练框架(ASTIG)。该框架首先利用视觉语言模型构建航空时序地表演变描述,并引入动态语义分解过程将复杂的时序描述转化为帧级视觉提示,为推理阶段提供细粒度的语义引导。随后,创新性地提出一种语言绑定策略,在扩散模型的交叉注意力机制中建立文本中关键地物与对应视觉属性的显式关联,从而增强模型对语义特征的捕捉能力。最后,集成时序锚点注意力机制,利用双参考帧约束确保主体与背景在帧间的时序一致性,有效抑制帧间的时序内容偏移问题。定性和定量实验结果表明,所提方法在时序连续性及语义保真度上均优于基线模型,为航空时序图像生成提供了新范式。Abstract:
Objective Aerial Earth observation plays a pivotal role in environmental monitoring, disaster warning, and urban planning. However, constraints such as flight-platform endurance and mission-window timeliness often prevent acquired aerial imagery from fully characterizing the long-term evolution of the Earth's surface. Although pre-trained latent diffusion models have shown strong potential for image generation, their application in aerial scenarios remains challenging because of the scarcity of high-quality temporal annotation data and semantic-visual misalignment caused by variable observation scales. To address these challenges, this paper proposes ASTIG, a training-free framework for Aerial Spatio-Temporal Image Generation. By leveraging the generative priors of pre-trained latent diffusion models and Large Language Models (LLMs), ASTIG provides a new paradigm for semantically controllable aerial spatio-temporal image generation. Methods ASTIG consists of three coordinated components. First, a dynamic semantic decomposition process is proposed to parse complex descriptions of aerial scene evolution into frame-level visual prompts, thereby compensating for the lack of temporal semantic annotations in existing aerial image-text datasets. Second, a Linguistic Binding (LB) strategy is proposed to establish explicit associations between key ground objects and their corresponding visual attributes within the cross-attention mechanism of the diffusion model, thereby improving the semantic response precision of the generated images. Third, a Temporal Anchor Attention (TAA) mechanism is incorporated. It uses dual reference frames to maintain subject stability and background consistency across the generated spatio-temporal image sequence, thus suppressing inter-frame temporal drift under training-free conditions. Results and Discussions ASTIG and the baseline methods are evaluated on 7,236 high-quality aerial spatio-temporal descriptions using six automated metrics, including subject consistency, background consistency, temporal flickering, motion smoothness, aesthetic quality, and imaging quality. Quantitative results ( Tables 1 and2 ) show that ASTIG outperforms the baseline methods in spatio-temporal image generation, with improvements of 3.91% in subject consistency and 4.57% in motion smoothness over the frame-prompt baseline. Qualitative comparisons (Fig. 4 ) further show its strong ability to model long-term surface evolution in aerial imagery. Ablation studies validate the individual effectiveness of the LB strategy and the TAA mechanism (Table 3 andFig. 5 ). Sensitivity analyses of the intervention steps (Table 4 andFig. 6 ) and binding strength (Table 5 andFig. 7 ) further identify suitable parameter settings. Extension experiments from satellite perspectives (Figs. 8 and9 ) also show that ASTIG has the potential to generalize beyond aerial platforms to broader Earth observation scenarios.Conclusions This paper proposes ASTIG, a training-free framework for aerial spatio-temporal image generation that addresses the scarcity of high-quality long-term temporal data and semantic-visual misalignment. By leveraging the generative priors of pre-trained latent diffusion models and LLMs, ASTIG integrates a dynamic semantic decomposition process, an LB strategy, and a TAA mechanism to improve temporal semantic construction, semantic response precision, and inter-frame consistency. Experimental results show that ASTIG outperforms existing baseline methods across multiple automated evaluation metrics, providing a new paradigm for aerial spatio-temporal image generation. As a training-free method, ASTIG is still limited by the prior knowledge of the backbone model. Future work will examine geometric correction and nadir-view prior constraints to better align the generated results with the physical properties of satellite imagery. -
表 1 单提示设置下本文方法与基线方法的定量对比结果(%)
方法 SC BC TF MS IQ AQ LaVie[19] 94.87 96.31 95.12 96.89 63.15 50.23 AnimateLCM[34] 96.13 97.68 94.07 95.94 53.48 52.87 CogVideoX[20] 95.58 97.48 98.54 98.76 49.57 48.83 Videoelevator (LaVie) [35] 92.34 95.18 93.21 93.45 65.53 60.18 Videoelevator
(AnimateLCM) [35]93.96 96.23 92.43 93.57 57.14 58.23 Text2Video-Zero[5] 89.55 93.89 85.03 87.64 74.82 51.23 Free-Blooms[22] 96.68 97.13 93.89 94.21 77.19 61.36 ASTIGs 97.01 97.34 96.47 97.24 78.03 61.72 表 2 帧提示设置下本文方法与基线方法的定量对比结果(%)
方法 SC BC TF MS IQ AQ Free-Bloom 92.47 94.39 92.14 93.38 76.58 62.37 ASTIG 96.38 95.87 96.71 96.02 77.03 62.80 表 3 各模块消融定量实验结果(%)
方法 SC BC TF MS IQ AQ ASTIG 96.02 95.87 96.71 96.02 77.03 62.80 w/o LB 95.53 95.67 95.84 95.12 75.21 57.58 w/o TAA 72.15 85.18 86.84 88.67 76.82 56.23 w/o $ \text{At}{\mathrm{t}}_{\text{former}} $ 96.38 95.83 96.38 95.74 76.98 61.36 w/o $ {\text{Att}}_{\text{first}} $ 89.21 91.58 94.72 96.53 75.98 59.78 表 4 干预步数$ N\mathrm{_{inv}} $ 的消融定量实验结果(%)
$ N\mathrm{_{inv}} $ SC BC TF MS IQ AQ 1 96.38 95.87 96.71 96.02 77.03 62.80 2 96.78 97.56 96.41 97.26 76.86 59.65 3 97.41 97.86 96.81 97.56 75.66 58.77 4 97.84 97.91 97.31 97.89 74.64 56.46 表 5 绑定强度 $ \eta $的消融定量实验结果(%)
SC BC TF MS IQ AQ 0 96.43 95.47 95.42 96.81 76.03 60.09 30 96.58 95.28 96.54 97.33 76.92 61.93 50 96.38 95.87 96.71 96.02 77.54 62.80 70 96.52 96.39 95.56 97.67 77.31 61.75 100 95.97 96.28 95.43 97.02 76.35 61.86 -
[1] MOHSAN S A H, OTHMAN N Q H, LI Yanlong, et al. Unmanned aerial vehicles (UAVs): Practical aspects, applications, open challenges, security issues, and future trends[J]. Intelligent Service Robotics, 2023, 16(1): 109–137. doi: 10.1007/s11370-022-00452-4. [2] LIU Yidan, YUE Jun, XIA Shaobo, et al. Diffusion models meet remote sensing: Principles, methods, and perspectives[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 4708322. doi: 10.1109/TGRS.2024.3464685. [3] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10674–10685. doi: 10.1109/CVPR52688.2022.01042. [4] 赵宏, 李文改. 基于扩散生成对抗网络的文本生成图像模型研究[J]. 电子与信息学报, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400.ZHAO Hong and LI Wengai. Text-to-image generation model based on diffusion wasserstein generative adversarial networks[J]. Journal of Electronics & Information Technology, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400. [5] KHACHATRYAN L, MOVSISYAN A, TADEVOSYAN V, et al. Text2Video-zero: Text-to-image diffusion models are zero-shot video generators[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 15908–15918. doi: 10.1109/ICCV51070.2023.01462. [6] WU J Z, GE Yixiao, WANG Xintao, et al. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 7589–7599. doi: 10.1109/ICCV51070.2023.00701. [7] KINGMA D P and WELLING M. Auto-encoding variational Bayes[C]. The 2nd International Conference on Learning Representations, Banff, Canada, 2014. [8] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139–144. doi: 10.1145/3422622. [9] 宋淼, 陈志强, 王培松, 等. DetDiffRS: 面向细节优化的遥感图像超分辨率扩散模型[J]. 电子与信息学报, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995.SONG Miao, CHEN Zhiqiang, WANG Peisong, et al. DetDiffRS: A detail-enhanced diffusion model for remote sensing image super-resolution[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995. [10] BEJIGA M B, MELGANI F, and VASCOTTO A. Retro-remote sensing: Generating images from ancient texts[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019, 12(3): 950–960. doi: 10.1109/JSTARS.2019.2895693. [11] CHRISTIE G, FENDLEY N, WILSON J, et al. Functional map of the world[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6172–6180. doi: 10.1109/CVPR.2018.00646. [12] VAN ETTEN A, LINDENBAUM D, and BACASTOW T M. SpaceNet: A remote sensing dataset and challenge series[J]. arXiv preprint arXiv: 1807.01232, 2018. doi: 10.48550/arXiv.1807.01232. [13] LIU Chenyang, CHEN Keyan, ZHAO Rui, et al. Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model[J]. IEEE Geoscience and Remote Sensing Magazine, 2025, 13(3): 238–259. doi: 10.1109/MGRS.2025.3560455. [14] TANG Datao, CAO Xiangyong, WU Xuan, et al. AeroGen: Enhancing remote sensing object detection with diffusion-driven data generation[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2025: 3614–3624. doi: 10.1109/CVPR52734.2025.00342. [15] 邓梓焌, 何相腾, 彭宇新. 文本到视频生成: 研究现状、进展和挑战[J]. 电子与信息学报, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074.DENG Zijun, HE Xiangteng, and PENG Yuxin. Text-to-video generation: Research status, progress and challenges[J]. Journal of Electronics & Information Technology, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074. [16] QASIM I, HORSCH A, and PRASAD D. Dense video captioning: A survey of techniques, datasets and evaluation protocols[J]. ACM Computing Surveys, 2025, 57(6): 154. doi: 10.1145/3712059. [17] HO J, SALIMANS T, GRITSENKO A, et al. Video diffusion models[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 628. [18] BLATTMANN A, ROMBACH R, LING Huan, et al. Align your latents: High-resolution video synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 22563–22575. doi: 10.1109/CVPR52729.2023.02161. [19] WANG Yaohui, CHEN Xinyuan, MA Xin, et al. LaVie: High-quality video generation with cascaded latent diffusion models[J]. International Journal of Computer Vision, 2025, 133(5): 3059–3078. doi: 10.1007/s11263-024-02295-1. [20] YANG Zhuoyi, TENG Jiayan, ZHENG Wendi, et al. CogVideoX: Text-to-video diffusion models with an expert transformer[C]. The 13th International Conference on Learning Representations, Singapore, Singapore, 2025. [21] GUO Yuwei, YANG Ceyuan, RAO Anyi, et al. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning[C]. The 12th International Conference on Learning Representations, Vienna, Austria, 2024. [22] HUANG Hanzhuo, FENG Yufan, SHI Cheng, et al. Free-bloom: Zero-shot text-to-video generator with LLM director and LDM animator[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 1138. [23] ZHU Hanxin, HE Tianyu, TANG Anni, et al. Compositional 3D-aware video generation with LLM director[C]. The 38th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. , 2024: 4184. [24] MA Yong, HAN Huasong, SHEN Shiyuan, et al. T2CV-zero: Zero-shot character video generation via text-to-motion model[C]. International Joint Conference on Neural Networks, Rome, Italy, 2025: 1–8. doi: 10.1109/IJCNN64981.2025.11229098. [25] LI Zhen, LI Chuanhao, MAO Xiaofeng, et al. Sekai: A video dataset towards world exploration[J]. arXiv preprint arXiv: 2506.15675, 2025. doi: 10.48550/arXiv.2506.15675. [26] WEN Longyin, DU Dawei, ZHU Pengfei, et al. Detection, tracking, and counting meets drones in crowds: A benchmark[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 7808–7817. doi: 10.1109/CVPR46437.2021.00772. [27] MOU Lichao, HUA Yuansheng, JIN Pu, et al. ERA: A data set and deep learning benchmark for event recognition in aerial videos [software and data sets][J]. IEEE Geoscience and Remote Sensing Magazine, 2020, 8(4): 125–133. doi: 10.1109/MGRS.2020.3005751. [28] WEI J, WANG Xuezhi, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1800. [29] RASSIN R, HIRSCH E, GLICKMAN D, et al. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 157. [30] SONG Jiaming, MENG Chenlin, and ERMON S. Denoising diffusion implicit models[C/OL]. The 9th International Conference on Learning Representations, 2021. [31] BAI Shuai, CAI Yuxuan, CHEN Ruizhe, et al. Qwen3-VL technical report[J]. arXiv preprint arXiv: 2511.21631, 2025. doi: 10.48550/arXiv.2511.21631. [32] Gemini Team Google. Gemini: A family of highly capable multimodal models[J]. arXiv preprint arXiv: 2312.11805, 2025. doi: 10.48550/arXiv.2312.11805. [33] HUANG Ziqi, HE Yinan, YU Jiashuo, et al. VBench: Comprehensive benchmark suite for video generative models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 21807–21818. doi: 10.1109/CVPR52733.2024.02060. [34] WANG Fuyun, HUANG Zhaoyang, BIAN Weikang, et al. AnimateLCM: Computation-efficient personalized style video generation without personalized video data[C]. SIGGRAPH Asia 2024 Technical Communications, Tokyo, Japan, 2024: 23. doi: 10.1145/3681758.3698013. [35] ZHANG Yabo, WEI Yuxiang, LIN Xianhui, et al. VideoElevator: Elevating video generation quality with versatile text-to-image diffusion models[C]. The 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 10266–10274. doi: 10.1609/aaai.v39i10.33114. -
下载:
下载: