Aerial Spatio-Temporal Image Generation via Latent Diffusion Model
-
摘要: 航空对地观测在环境监测、灾害预警和城市规划等领域发挥着关键作用。然而,受限于飞行平台续航能力、任务窗口时效性等现实约束,所获取的航空图像往往难以完整刻画地表的长期演化过程。尽管预训练扩散模型在图像生成领域展现出巨大潜力,但其在航空领域的应用仍面临严峻挑战。一方面,航空观测易受天气及飞行条件制约,高质量的长时序标注样本获取困难,难以支撑模型的有效训练;另一方面,航空平台飞行高度灵活多变,地物尺度跨度大,使得文本描述语义难以与多尺度视觉特征精确匹配,进而引发语义-视觉对齐偏差问题。为应对上述挑战,本文提出了一个用于航空时序图像生成的免训练框架(ASTIG)。该框架首先利用视觉语言模型构建航空时序地表演变描述,并引入动态语义分解过程将复杂的时序描述转化为帧级视觉提示,为推理阶段提供细粒度的语义引导。随后,创新性地提出了一种语言绑定策略,在扩散模型的交叉注意力机制中建立文本中关键地物与对应视觉属性的显式关联,从而增强模型对语义特征的捕捉能力。最后,集成时序锚点注意力机制,利用双参考帧约束确保主体与背景在帧间的时序一致性,有效抑制帧间的时序内容偏移问题。定性和定量实验结果表明,本方法在时序连续性及语义保真度上均优于基线模型,为航空时序图像生成提供了新范式。Abstract:
Objective Aerial Earth observation plays a pivotal role in environmental monitoring, disaster warning, and urban planning. However, constrained by flight platform endurance, mission window timeliness, and other operational limitations, the acquired aerial imagery often fails to fully characterize the long-term evolutionary processes of the Earth's surface. Although pre-trained diffusion models have demonstrated considerable potential in image generation, their application in the aerial domain remains challenging due to the scarcity of high-quality temporal annotation data and the semantic–visual misalignment arising from variable observation scales. To address these challenges, this paper proposes ASTIG, a training-free framework for Aerial Spatio-Temporal Image Generation. By leveraging the generative priors of pre-trained latent diffusion models and large language models, ASTIG establishes a novel paradigm for semantically controllable aerial temporal image generation. Methods ASTIG comprises three synergistically designed components. First, a dynamic semantic decomposition process is introduced to automatically parse complex aerial scene evolution descriptions into frame-level visual prompts, compensating for the lack of temporal semantic annotations in existing aerial image-text datasets. Second, a linguistic binding strategy is proposed to establish explicit associations between key ground objects and their corresponding visual attributes within the cross-attention mechanism of the diffusion model, thereby enhancing the semantic response precision of generated imagery. Third, a temporal anchor attention mechanism is integrated, which employs dual reference frames to enforce subject stability and background consistency across the generated temporal images, effectively suppressing inter-frame temporal drift under training-free conditions. Results and Discussions ASTIG and baseline models are evaluated on 7, 236 high-quality aerial spatio-temporal descriptions across six automated metrics, including subject consistency, background consistency, temporal flickering, motion smoothness, aesthetic quality, and imaging quality. Quantitative results ( Tables 1 and2 ) demonstrate the superiority of ASTIG in temporal image generation, with improvements of 3. 91% and 4. 57% in subject consistency and temporal smoothness over the frame-prompt baseline, respectively. Qualitative comparisons (Fig. 4 ) further highlight its robust capability in modeling long-term geospatial evolutionary imagery. Ablation studies validate the individual effectiveness of the linguistic binding strategy and the temporal anchor attention mechanism (Table 3 andFig. 5 ). Sensitivity analyses of the intervention steps (Table 4 andFig. 6 ) and binding strength (Table 5 andFig. 7 ) provide systematic exploration of optimal parameter configurations. Furthermore, extensibility experiments under satellite perspectives (Figs. 8 and9 ) reveal the potential of ASTIG to generalize beyond aerial platforms to broader Earth observation scenarios.Conclusions This paper proposes ASTIG, a training-free framework for aerial spatio-temporal image generation that addresses the scarcity of high-quality long-term temporal data and semantic–visual misalignment. By leveraging the generative priors of pre-trained latent diffusion models and large language models, ASTIG integrates dynamic semantic decomposition process, linguistic binding strategy, and temporal anchor attention mechanism to jointly address temporal semantic construction, semantic response precision, and inter-frame consistency. Experimental results demonstrate that ASTIG outperforms existing baseline methods across multiple automated evaluation metrics, offering a novel paradigm for aerial spatio-temporal image generation. As a training-free method, the performance of ASTIG is inherently bounded by the prior knowledge of the backbone model. Future work will explore geometric correction and nadir-view prior constraints to better align generated results with the physical properties of satellite imagery. -
表 1 单提示设置下本文方法与基线方法的定量对比结果
方法 SC BC TF MS IQ AQ LaVie[19] 94.87 96.31 95.12 96.89 63.15 50.23 AnimateLCM[34] 96.13 97.68 94.07 95.94 53.48 52.87 CogVideoX[20] 95.58 97.48 98.54 98.76 49.57 48.83 Videoelevator (LaVie) [35] 92.34 95.18 93.21 93.45 65.53 60.18 Videoelevator (AnimateLCM) [35] 93.96 96.23 92.43 93.57 57.14 58.23 Text2Video-Zero[5] 89.55 93.89 85.03 87.64 74.82 51.23 $ {\text{Free-Bloom}}_{\text{s}} $ [22] 96.68 97.13 93.89 94.21 77.19 61.36 $ {\text{ASTIG}}_{\text{s}} $ 97.01 97.34 96.47 97.24 78.03 61.72 表 2 帧提示设置下本文方法与基线方法的定量对比结果
方法 SC BC TF MS IQ AQ Free-Bloom 92.47 94.39 92.14 93.38 76.58 62.37 ASTIG 96.38 95.87 96.71 96.02 77.03 62.80 表 3 各模块消融定量实验结果
方法 SC BC TF MS IQ AQ ASTIG 96.02 95.87 96.71 96.02 77.03 62.80 w/o LB 95.53 95.67 95.84 95.12 75.21 57.58 w/o TAA 72.15 85.18 86.84 88.67 76.82 56.23 w/o $ \text{At}{\mathrm{t}}_{\text{former}} $ 96.38 95.83 96.38 95.74 76.98 61.36 w/o $ {\text{Att}}_{\text{first}} $ 89.21 91.58 94.72 96.53 75.98 59.78 表 4 干预步数$ {N}_{inv} $ 的消融定量实验结果
$ {N}_{inv} $ SC BC TF MS IQ AQ 1 96.38 95.87 96.71 96.02 77.03 62.80 2 96.78 97.56 96.41 97.26 76.86 59.65 3 97.41 97.86 96.81 97.56 75.66 58.77 4 97.84 97.91 97.31 97.89 74.64 56.46 表 5 绑定强度 $ \eta $的消融定量实验结果
SC BC TF MS IQ AQ 0 96.43 95.47 95.42 96.81 76.03 60.09 30 96.58 95.28 96.54 97.33 76.92 61.93 50 96.38 95.87 96.71 96.02 77.54 62.80 70 96.52 96.39 95.56 97.67 77.31 61.75 100 95.97 96.28 95.43 97.02 76.35 61.86 -
[1] MOHSAN S A H, OTHMAN N Q H, LI Yanlong, et al. Unmanned aerial vehicles (UAVs): Practical aspects, applications, open challenges, security issues, and future trends[J]. Intelligent Service Robotics, 2023, 16(1): 109–137. doi: 10.1007/s11370-022-00452-4. [2] LIU Yidan, YUE Jun, XIA Shaobo, et al. Diffusion models meet remote sensing: Principles, methods, and perspectives[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 4708322. doi: 10.1109/TGRS.2024.3464685. [3] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10674–10685. doi: 10.1109/CVPR52688.2022.01042. [4] 赵宏, 李文改. 基于扩散生成对抗网络的文本生成图像模型研究[J]. 电子与信息学报, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400.ZHAO Hong and LI Wengai. Text-to-image generation model based on diffusion wasserstein generative adversarial networks[J]. Journal of Electronics & Information Technology, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400. [5] KHACHATRYAN L, MOVSISYAN A, TADEVOSYAN V, et al. Text2Video-zero: Text-to-image diffusion models are zero-shot video generators[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 15908–15918. doi: 10.1109/ICCV51070.2023.01462. [6] WU J Z, GE Yixiao, WANG Xintao, et al. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 7589–7599. doi: 10.1109/ICCV51070.2023.00701. [7] KINGMA D P and WELLING M. Auto-encoding variational Bayes[C]. The 2nd International Conference on Learning Representations, Banff, Canada, 2014. [8] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139–144. doi: 10.1145/3422622. [9] 宋淼, 陈志强, 王培松, 等. DetDiffRS: 面向细节优化的遥感图像超分辨率扩散模型[J]. 电子与信息学报, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995.SONG Miao, CHEN Zhiqiang, WANG Peisong, et al. DetDiffRS: A detail-enhanced diffusion model for remote sensing image super-resolution[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995. [10] BEJIGA M B, MELGANI F, and VASCOTTO A. Retro-remote sensing: Generating images from ancient texts[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019, 12(3): 950–960. doi: 10.1109/JSTARS.2019.2895693. [11] CHRISTIE G, FENDLEY N, WILSON J, et al. Functional map of the world[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6172–6180. doi: 10.1109/CVPR.2018.00646. [12] VAN ETTEN A, LINDENBAUM D, and BACASTOW T M. SpaceNet: A remote sensing dataset and challenge series[J]. arXiv preprint arXiv: 1807.01232, 2018. doi: 10.48550/arXiv.1807.01232. (查阅网上资料,不确定本条文献类型及格式是否正确,请确认). [13] LIU Chenyang, CHEN Keyan, ZHAO Rui, et al. Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model[J]. IEEE Geoscience and Remote Sensing Magazine, 2025, 13(3): 238–259. doi: 10.1109/MGRS.2025.3560455. [14] TANG Datao, CAO Xiangyong, WU Xuan, et al. AeroGen: Enhancing remote sensing object detection with diffusion-driven data generation[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2025: 3614–3624. doi: 10.1109/CVPR52734.2025.00342. [15] 邓梓焌, 何相腾, 彭宇新. 文本到视频生成: 研究现状、进展和挑战[J]. 电子与信息学报, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074.DENG Zijun, HE Xiangteng, and PENG Yuxin. Text-to-video generation: Research status, progress and challenges[J]. Journal of Electronics & Information Technology, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074. [16] QASIM I, HORSCH A, and PRASAD D. Dense video captioning: A survey of techniques, datasets and evaluation protocols[J]. ACM Computing Surveys, 2025, 57(6): 154. doi: 10.1145/3712059. [17] HO J, SALIMANS T, GRITSENKO A, et al. Video diffusion models[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 628. [18] BLATTMANN A, ROMBACH R, LING Huan, et al. Align your latents: High-resolution video synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 22563–22575. doi: 10.1109/CVPR52729.2023.02161. [19] WANG Yaohui, CHEN Xinyuan, MA Xin, et al. LaVie: High-quality video generation with cascaded latent diffusion models[J]. International Journal of Computer Vision, 2025, 133(5): 3059–3078. doi: 10.1007/s11263-024-02295-1. [20] YANG Zhuoyi, TENG Jiayan, ZHENG Wendi, et al. CogVideoX: Text-to-video diffusion models with an expert transformer[C]. The 13th International Conference on Learning Representations, Singapore, Singapore, 2025. [21] GUO Yuwei, YANG Ceyuan, RAO Anyi, et al. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning[C]. The 12th International Conference on Learning Representations, Vienna, Austria, 2024. [22] HUANG Hanzhuo, FENG Yufan, SHI Cheng, et al. Free-bloom: Zero-shot text-to-video generator with LLM director and LDM animator[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 1138. [23] ZHU Hanxin, HE Tianyu, TANG Anni, et al. Compositional 3D-aware video generation with LLM director[C]. Proceedings of the 38th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. , 2024: 4184. [24] MA Yong, HAN Huasong, SHEN Shiyuan, et al. T2CV-zero: Zero-shot character video generation via text-to-motion model[C]. International Joint Conference on Neural Networks, Rome, Italy, 2025: 1–8. doi: 10.1109/IJCNN64981.2025.11229098. [25] LI Zhen, LI Chuanhao, MAO Xiaofeng, et al. Sekai: A video dataset towards world exploration[J]. arXiv preprint arXiv: 2506.15675, 2025. doi: 10.48550/arXiv.2506.15675. (查阅网上资料,不确定本条文献类型及格式是否正确,请确认). [26] WEN Longyin, DU Dawei, ZHU Pengfei, et al. Detection, tracking, and counting meets drones in crowds: A benchmark[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 7808–7817. doi: 10.1109/CVPR46437.2021.00772. [27] MOU Lichao, HUA Yuansheng, JIN Pu, et al. ERA: A data set and deep learning benchmark for event recognition in aerial videos [software and data sets][J]. IEEE Geoscience and Remote Sensing Magazine, 2020, 8(4): 125–133. doi: 10.1109/MGRS.2020.3005751. [28] WEI J, WANG Xuezhi, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1800. [29] RASSIN R, HIRSCH E, GLICKMAN D, et al. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 157. [30] SONG Jiaming, MENG Chenlin, and ERMON S. Denoising diffusion implicit models[C]. 9th International Conference on Learning Representations, 2021. (查阅网上资料, 未找到本条文献出版地信息, 请确认). [31] BAI Shuai, CAI Yuxuan, CHEN Ruizhe, et al. Qwen3-VL technical report[J]. arXiv preprint arXiv: 2511.21631, 2025. doi: 10.48550/arXiv.2511.21631. (查阅网上资料,不确定本条文献类型及格式是否正确,请确认). [32] Gemini Team Google. Gemini: A family of highly capable multimodal models[J]. arXiv preprint arXiv: 2312.11805, 2025. doi: 10.48550/arXiv.2312.11805. (查阅网上资料,不确定本条文献类型及格式是否正确,请确认). [33] HUANG Ziqi, HE Yinan, YU Jiashuo, et al. VBench: Comprehensive benchmark suite for video generative models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 21807–21818. doi: 10.1109/CVPR52733.2024.02060. [34] WANG Fuyun, HUANG Zhaoyang, BIAN Weikang, et al. AnimateLCM: Computation-efficient personalized style video generation without personalized video data[C]. SIGGRAPH Asia 2024 Technical Communications, Tokyo, Japan, 2024: 23. doi: 10.1145/3681758.3698013. [35] ZHANG Yabo, WEI Yuxiang, LIN Xianhui, et al. VideoElevator: Elevating video generation quality with versatile text-to-image diffusion models[C]. Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 10266–10274. doi: 10.1609/aaai.v39i10.33114. -
下载:
下载: