Advanced Search
Turn off MathJax
Article Contents
SHANG Yuying, HOU Yingyan, LIU Zinan, LU Wanxuan, HUANG Yuhong, WANG Yixiao, YU Hongfeng, FU Kun. Aerial Spatio-Temporal Image Generation via Latent Diffusion Models[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260165
Citation: SHANG Yuying, HOU Yingyan, LIU Zinan, LU Wanxuan, HUANG Yuhong, WANG Yixiao, YU Hongfeng, FU Kun. Aerial Spatio-Temporal Image Generation via Latent Diffusion Models[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260165

Aerial Spatio-Temporal Image Generation via Latent Diffusion Models

doi: 10.11999/JEIT260165 cstr: 32379.14.JEIT260165
Funds:  The Strategic Priority Research Program of the Chinese Academy of Sciences (XDB1600000, XDB1600102)
  • Received Date: 2026-02-06
  • Accepted Date: 2026-04-09
  • Rev Recd Date: 2026-03-09
  • Available Online: 2026-04-13
  •   Objective  Aerial Earth observation plays a pivotal role in environmental monitoring, disaster warning, and urban planning. However, constraints such as flight-platform endurance and mission-window timeliness often prevent acquired aerial imagery from fully characterizing the long-term evolution of the Earth's surface. Although pre-trained latent diffusion models have shown strong potential for image generation, their application in aerial scenarios remains challenging because of the scarcity of high-quality temporal annotation data and semantic-visual misalignment caused by variable observation scales. To address these challenges, this paper proposes ASTIG, a training-free framework for Aerial Spatio-Temporal Image Generation. By leveraging the generative priors of pre-trained latent diffusion models and Large Language Models (LLMs), ASTIG provides a new paradigm for semantically controllable aerial spatio-temporal image generation.  Methods  ASTIG consists of three coordinated components. First, a dynamic semantic decomposition process is proposed to parse complex descriptions of aerial scene evolution into frame-level visual prompts, thereby compensating for the lack of temporal semantic annotations in existing aerial image-text datasets. Second, a Linguistic Binding (LB) strategy is proposed to establish explicit associations between key ground objects and their corresponding visual attributes within the cross-attention mechanism of the diffusion model, thereby improving the semantic response precision of the generated images. Third, a Temporal Anchor Attention (TAA) mechanism is incorporated. It uses dual reference frames to maintain subject stability and background consistency across the generated spatio-temporal image sequence, thus suppressing inter-frame temporal drift under training-free conditions.  Results and Discussions  ASTIG and the baseline methods are evaluated on 7,236 high-quality aerial spatio-temporal descriptions using six automated metrics, including subject consistency, background consistency, temporal flickering, motion smoothness, aesthetic quality, and imaging quality. Quantitative results (Tables 1 and 2) show that ASTIG outperforms the baseline methods in spatio-temporal image generation, with improvements of 3.91% in subject consistency and 4.57% in motion smoothness over the frame-prompt baseline. Qualitative comparisons (Fig. 4) further show its strong ability to model long-term surface evolution in aerial imagery. Ablation studies validate the individual effectiveness of the LB strategy and the TAA mechanism (Table 3 and Fig. 5). Sensitivity analyses of the intervention steps (Table 4 and Fig. 6) and binding strength (Table 5 and Fig. 7) further identify suitable parameter settings. Extension experiments from satellite perspectives (Figs. 8 and 9) also show that ASTIG has the potential to generalize beyond aerial platforms to broader Earth observation scenarios.  Conclusions  This paper proposes ASTIG, a training-free framework for aerial spatio-temporal image generation that addresses the scarcity of high-quality long-term temporal data and semantic-visual misalignment. By leveraging the generative priors of pre-trained latent diffusion models and LLMs, ASTIG integrates a dynamic semantic decomposition process, an LB strategy, and a TAA mechanism to improve temporal semantic construction, semantic response precision, and inter-frame consistency. Experimental results show that ASTIG outperforms existing baseline methods across multiple automated evaluation metrics, providing a new paradigm for aerial spatio-temporal image generation. As a training-free method, ASTIG is still limited by the prior knowledge of the backbone model. Future work will examine geometric correction and nadir-view prior constraints to better align the generated results with the physical properties of satellite imagery.
  • loading
  • [1]
    MOHSAN S A H, OTHMAN N Q H, LI Yanlong, et al. Unmanned aerial vehicles (UAVs): Practical aspects, applications, open challenges, security issues, and future trends[J]. Intelligent Service Robotics, 2023, 16(1): 109–137. doi: 10.1007/s11370-022-00452-4.
    [2]
    LIU Yidan, YUE Jun, XIA Shaobo, et al. Diffusion models meet remote sensing: Principles, methods, and perspectives[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 4708322. doi: 10.1109/TGRS.2024.3464685.
    [3]
    ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10674–10685. doi: 10.1109/CVPR52688.2022.01042.
    [4]
    赵宏, 李文改. 基于扩散生成对抗网络的文本生成图像模型研究[J]. 电子与信息学报, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400.

    ZHAO Hong and LI Wengai. Text-to-image generation model based on diffusion wasserstein generative adversarial networks[J]. Journal of Electronics & Information Technology, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400.
    [5]
    KHACHATRYAN L, MOVSISYAN A, TADEVOSYAN V, et al. Text2Video-zero: Text-to-image diffusion models are zero-shot video generators[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 15908–15918. doi: 10.1109/ICCV51070.2023.01462.
    [6]
    WU J Z, GE Yixiao, WANG Xintao, et al. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 7589–7599. doi: 10.1109/ICCV51070.2023.00701.
    [7]
    KINGMA D P and WELLING M. Auto-encoding variational Bayes[C]. The 2nd International Conference on Learning Representations, Banff, Canada, 2014.
    [8]
    GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139–144. doi: 10.1145/3422622.
    [9]
    宋淼, 陈志强, 王培松, 等. DetDiffRS: 面向细节优化的遥感图像超分辨率扩散模型[J]. 电子与信息学报, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995.

    SONG Miao, CHEN Zhiqiang, WANG Peisong, et al. DetDiffRS: A detail-enhanced diffusion model for remote sensing image super-resolution[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995.
    [10]
    BEJIGA M B, MELGANI F, and VASCOTTO A. Retro-remote sensing: Generating images from ancient texts[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019, 12(3): 950–960. doi: 10.1109/JSTARS.2019.2895693.
    [11]
    CHRISTIE G, FENDLEY N, WILSON J, et al. Functional map of the world[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6172–6180. doi: 10.1109/CVPR.2018.00646.
    [12]
    VAN ETTEN A, LINDENBAUM D, and BACASTOW T M. SpaceNet: A remote sensing dataset and challenge series[J]. arXiv preprint arXiv: 1807.01232, 2018. doi: 10.48550/arXiv.1807.01232.
    [13]
    LIU Chenyang, CHEN Keyan, ZHAO Rui, et al. Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model[J]. IEEE Geoscience and Remote Sensing Magazine, 2025, 13(3): 238–259. doi: 10.1109/MGRS.2025.3560455.
    [14]
    TANG Datao, CAO Xiangyong, WU Xuan, et al. AeroGen: Enhancing remote sensing object detection with diffusion-driven data generation[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2025: 3614–3624. doi: 10.1109/CVPR52734.2025.00342.
    [15]
    邓梓焌, 何相腾, 彭宇新. 文本到视频生成: 研究现状、进展和挑战[J]. 电子与信息学报, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074.

    DENG Zijun, HE Xiangteng, and PENG Yuxin. Text-to-video generation: Research status, progress and challenges[J]. Journal of Electronics & Information Technology, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074.
    [16]
    QASIM I, HORSCH A, and PRASAD D. Dense video captioning: A survey of techniques, datasets and evaluation protocols[J]. ACM Computing Surveys, 2025, 57(6): 154. doi: 10.1145/3712059.
    [17]
    HO J, SALIMANS T, GRITSENKO A, et al. Video diffusion models[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 628.
    [18]
    BLATTMANN A, ROMBACH R, LING Huan, et al. Align your latents: High-resolution video synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 22563–22575. doi: 10.1109/CVPR52729.2023.02161.
    [19]
    WANG Yaohui, CHEN Xinyuan, MA Xin, et al. LaVie: High-quality video generation with cascaded latent diffusion models[J]. International Journal of Computer Vision, 2025, 133(5): 3059–3078. doi: 10.1007/s11263-024-02295-1.
    [20]
    YANG Zhuoyi, TENG Jiayan, ZHENG Wendi, et al. CogVideoX: Text-to-video diffusion models with an expert transformer[C]. The 13th International Conference on Learning Representations, Singapore, Singapore, 2025.
    [21]
    GUO Yuwei, YANG Ceyuan, RAO Anyi, et al. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning[C]. The 12th International Conference on Learning Representations, Vienna, Austria, 2024.
    [22]
    HUANG Hanzhuo, FENG Yufan, SHI Cheng, et al. Free-bloom: Zero-shot text-to-video generator with LLM director and LDM animator[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 1138.
    [23]
    ZHU Hanxin, HE Tianyu, TANG Anni, et al. Compositional 3D-aware video generation with LLM director[C]. The 38th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. , 2024: 4184.
    [24]
    MA Yong, HAN Huasong, SHEN Shiyuan, et al. T2CV-zero: Zero-shot character video generation via text-to-motion model[C]. International Joint Conference on Neural Networks, Rome, Italy, 2025: 1–8. doi: 10.1109/IJCNN64981.2025.11229098.
    [25]
    LI Zhen, LI Chuanhao, MAO Xiaofeng, et al. Sekai: A video dataset towards world exploration[J]. arXiv preprint arXiv: 2506.15675, 2025. doi: 10.48550/arXiv.2506.15675.
    [26]
    WEN Longyin, DU Dawei, ZHU Pengfei, et al. Detection, tracking, and counting meets drones in crowds: A benchmark[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 7808–7817. doi: 10.1109/CVPR46437.2021.00772.
    [27]
    MOU Lichao, HUA Yuansheng, JIN Pu, et al. ERA: A data set and deep learning benchmark for event recognition in aerial videos [software and data sets][J]. IEEE Geoscience and Remote Sensing Magazine, 2020, 8(4): 125–133. doi: 10.1109/MGRS.2020.3005751.
    [28]
    WEI J, WANG Xuezhi, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1800.
    [29]
    RASSIN R, HIRSCH E, GLICKMAN D, et al. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 157.
    [30]
    SONG Jiaming, MENG Chenlin, and ERMON S. Denoising diffusion implicit models[C/OL]. The 9th International Conference on Learning Representations, 2021.
    [31]
    BAI Shuai, CAI Yuxuan, CHEN Ruizhe, et al. Qwen3-VL technical report[J]. arXiv preprint arXiv: 2511.21631, 2025. doi: 10.48550/arXiv.2511.21631.
    [32]
    Gemini Team Google. Gemini: A family of highly capable multimodal models[J]. arXiv preprint arXiv: 2312.11805, 2025. doi: 10.48550/arXiv.2312.11805.
    [33]
    HUANG Ziqi, HE Yinan, YU Jiashuo, et al. VBench: Comprehensive benchmark suite for video generative models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 21807–21818. doi: 10.1109/CVPR52733.2024.02060.
    [34]
    WANG Fuyun, HUANG Zhaoyang, BIAN Weikang, et al. AnimateLCM: Computation-efficient personalized style video generation without personalized video data[C]. SIGGRAPH Asia 2024 Technical Communications, Tokyo, Japan, 2024: 23. doi: 10.1145/3681758.3698013.
    [35]
    ZHANG Yabo, WEI Yuxiang, LIN Xianhui, et al. VideoElevator: Elevating video generation quality with versatile text-to-image diffusion models[C]. The 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 10266–10274. doi: 10.1609/aaai.v39i10.33114.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(9)  / Tables(5)

    Article Metrics

    Article views (189) PDF downloads(19) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return