Advanced Search
Turn off MathJax
Article Contents
SHANG Yuying, HOU Yingyan, LIU Zinan, LU Wanxuan, HUANG Yuhong, WANG Yixiao, YU Hongfeng, FU Kun. Aerial Spatio-Temporal Image Generation via Latent Diffusion Model[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260165
Citation: SHANG Yuying, HOU Yingyan, LIU Zinan, LU Wanxuan, HUANG Yuhong, WANG Yixiao, YU Hongfeng, FU Kun. Aerial Spatio-Temporal Image Generation via Latent Diffusion Model[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260165

Aerial Spatio-Temporal Image Generation via Latent Diffusion Model

doi: 10.11999/JEIT260165 cstr: 32379.14.JEIT260165
  • Received Date: 2026-02-06
  • Accepted Date: 2026-04-09
  • Rev Recd Date: 2026-03-09
  • Available Online: 2026-04-13
  •   Objective  Aerial Earth observation plays a pivotal role in environmental monitoring, disaster warning, and urban planning. However, constrained by flight platform endurance, mission window timeliness, and other operational limitations, the acquired aerial imagery often fails to fully characterize the long-term evolutionary processes of the Earth's surface. Although pre-trained diffusion models have demonstrated considerable potential in image generation, their application in the aerial domain remains challenging due to the scarcity of high-quality temporal annotation data and the semantic–visual misalignment arising from variable observation scales. To address these challenges, this paper proposes ASTIG, a training-free framework for Aerial Spatio-Temporal Image Generation. By leveraging the generative priors of pre-trained latent diffusion models and large language models, ASTIG establishes a novel paradigm for semantically controllable aerial temporal image generation.  Methods  ASTIG comprises three synergistically designed components. First, a dynamic semantic decomposition process is introduced to automatically parse complex aerial scene evolution descriptions into frame-level visual prompts, compensating for the lack of temporal semantic annotations in existing aerial image-text datasets. Second, a linguistic binding strategy is proposed to establish explicit associations between key ground objects and their corresponding visual attributes within the cross-attention mechanism of the diffusion model, thereby enhancing the semantic response precision of generated imagery. Third, a temporal anchor attention mechanism is integrated, which employs dual reference frames to enforce subject stability and background consistency across the generated temporal images, effectively suppressing inter-frame temporal drift under training-free conditions.  Results and Discussions  ASTIG and baseline models are evaluated on 7, 236 high-quality aerial spatio-temporal descriptions across six automated metrics, including subject consistency, background consistency, temporal flickering, motion smoothness, aesthetic quality, and imaging quality. Quantitative results (Tables 1 and 2) demonstrate the superiority of ASTIG in temporal image generation, with improvements of 3. 91% and 4. 57% in subject consistency and temporal smoothness over the frame-prompt baseline, respectively. Qualitative comparisons (Fig. 4) further highlight its robust capability in modeling long-term geospatial evolutionary imagery. Ablation studies validate the individual effectiveness of the linguistic binding strategy and the temporal anchor attention mechanism (Table 3 and Fig. 5). Sensitivity analyses of the intervention steps (Table 4 and Fig. 6) and binding strength (Table 5 and Fig. 7) provide systematic exploration of optimal parameter configurations. Furthermore, extensibility experiments under satellite perspectives (Figs. 8 and 9) reveal the potential of ASTIG to generalize beyond aerial platforms to broader Earth observation scenarios.  Conclusions  This paper proposes ASTIG, a training-free framework for aerial spatio-temporal image generation that addresses the scarcity of high-quality long-term temporal data and semantic–visual misalignment. By leveraging the generative priors of pre-trained latent diffusion models and large language models, ASTIG integrates dynamic semantic decomposition process, linguistic binding strategy, and temporal anchor attention mechanism to jointly address temporal semantic construction, semantic response precision, and inter-frame consistency. Experimental results demonstrate that ASTIG outperforms existing baseline methods across multiple automated evaluation metrics, offering a novel paradigm for aerial spatio-temporal image generation. As a training-free method, the performance of ASTIG is inherently bounded by the prior knowledge of the backbone model. Future work will explore geometric correction and nadir-view prior constraints to better align generated results with the physical properties of satellite imagery.
  • loading
  • [1]
    MOHSAN S A H, OTHMAN N Q H, LI Yanlong, et al. Unmanned aerial vehicles (UAVs): Practical aspects, applications, open challenges, security issues, and future trends[J]. Intelligent Service Robotics, 2023, 16(1): 109–137. doi: 10.1007/s11370-022-00452-4.
    [2]
    LIU Yidan, YUE Jun, XIA Shaobo, et al. Diffusion models meet remote sensing: Principles, methods, and perspectives[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 4708322. doi: 10.1109/TGRS.2024.3464685.
    [3]
    ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10674–10685. doi: 10.1109/CVPR52688.2022.01042.
    [4]
    赵宏, 李文改. 基于扩散生成对抗网络的文本生成图像模型研究[J]. 电子与信息学报, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400.

    ZHAO Hong and LI Wengai. Text-to-image generation model based on diffusion wasserstein generative adversarial networks[J]. Journal of Electronics & Information Technology, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400.
    [5]
    KHACHATRYAN L, MOVSISYAN A, TADEVOSYAN V, et al. Text2Video-zero: Text-to-image diffusion models are zero-shot video generators[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 15908–15918. doi: 10.1109/ICCV51070.2023.01462.
    [6]
    WU J Z, GE Yixiao, WANG Xintao, et al. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 7589–7599. doi: 10.1109/ICCV51070.2023.00701.
    [7]
    KINGMA D P and WELLING M. Auto-encoding variational Bayes[C]. The 2nd International Conference on Learning Representations, Banff, Canada, 2014.
    [8]
    GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139–144. doi: 10.1145/3422622.
    [9]
    宋淼, 陈志强, 王培松, 等. DetDiffRS: 面向细节优化的遥感图像超分辨率扩散模型[J]. 电子与信息学报, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995.

    SONG Miao, CHEN Zhiqiang, WANG Peisong, et al. DetDiffRS: A detail-enhanced diffusion model for remote sensing image super-resolution[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995.
    [10]
    BEJIGA M B, MELGANI F, and VASCOTTO A. Retro-remote sensing: Generating images from ancient texts[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019, 12(3): 950–960. doi: 10.1109/JSTARS.2019.2895693.
    [11]
    CHRISTIE G, FENDLEY N, WILSON J, et al. Functional map of the world[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6172–6180. doi: 10.1109/CVPR.2018.00646.
    [12]
    VAN ETTEN A, LINDENBAUM D, and BACASTOW T M. SpaceNet: A remote sensing dataset and challenge series[J]. arXiv preprint arXiv: 1807.01232, 2018. doi: 10.48550/arXiv.1807.01232. (查阅网上资料,不确定本条文献类型及格式是否正确,请确认).
    [13]
    LIU Chenyang, CHEN Keyan, ZHAO Rui, et al. Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model[J]. IEEE Geoscience and Remote Sensing Magazine, 2025, 13(3): 238–259. doi: 10.1109/MGRS.2025.3560455.
    [14]
    TANG Datao, CAO Xiangyong, WU Xuan, et al. AeroGen: Enhancing remote sensing object detection with diffusion-driven data generation[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2025: 3614–3624. doi: 10.1109/CVPR52734.2025.00342.
    [15]
    邓梓焌, 何相腾, 彭宇新. 文本到视频生成: 研究现状、进展和挑战[J]. 电子与信息学报, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074.

    DENG Zijun, HE Xiangteng, and PENG Yuxin. Text-to-video generation: Research status, progress and challenges[J]. Journal of Electronics & Information Technology, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074.
    [16]
    QASIM I, HORSCH A, and PRASAD D. Dense video captioning: A survey of techniques, datasets and evaluation protocols[J]. ACM Computing Surveys, 2025, 57(6): 154. doi: 10.1145/3712059.
    [17]
    HO J, SALIMANS T, GRITSENKO A, et al. Video diffusion models[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 628.
    [18]
    BLATTMANN A, ROMBACH R, LING Huan, et al. Align your latents: High-resolution video synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 22563–22575. doi: 10.1109/CVPR52729.2023.02161.
    [19]
    WANG Yaohui, CHEN Xinyuan, MA Xin, et al. LaVie: High-quality video generation with cascaded latent diffusion models[J]. International Journal of Computer Vision, 2025, 133(5): 3059–3078. doi: 10.1007/s11263-024-02295-1.
    [20]
    YANG Zhuoyi, TENG Jiayan, ZHENG Wendi, et al. CogVideoX: Text-to-video diffusion models with an expert transformer[C]. The 13th International Conference on Learning Representations, Singapore, Singapore, 2025.
    [21]
    GUO Yuwei, YANG Ceyuan, RAO Anyi, et al. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning[C]. The 12th International Conference on Learning Representations, Vienna, Austria, 2024.
    [22]
    HUANG Hanzhuo, FENG Yufan, SHI Cheng, et al. Free-bloom: Zero-shot text-to-video generator with LLM director and LDM animator[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 1138.
    [23]
    ZHU Hanxin, HE Tianyu, TANG Anni, et al. Compositional 3D-aware video generation with LLM director[C]. Proceedings of the 38th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. , 2024: 4184.
    [24]
    MA Yong, HAN Huasong, SHEN Shiyuan, et al. T2CV-zero: Zero-shot character video generation via text-to-motion model[C]. International Joint Conference on Neural Networks, Rome, Italy, 2025: 1–8. doi: 10.1109/IJCNN64981.2025.11229098.
    [25]
    LI Zhen, LI Chuanhao, MAO Xiaofeng, et al. Sekai: A video dataset towards world exploration[J]. arXiv preprint arXiv: 2506.15675, 2025. doi: 10.48550/arXiv.2506.15675. (查阅网上资料,不确定本条文献类型及格式是否正确,请确认).
    [26]
    WEN Longyin, DU Dawei, ZHU Pengfei, et al. Detection, tracking, and counting meets drones in crowds: A benchmark[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 7808–7817. doi: 10.1109/CVPR46437.2021.00772.
    [27]
    MOU Lichao, HUA Yuansheng, JIN Pu, et al. ERA: A data set and deep learning benchmark for event recognition in aerial videos [software and data sets][J]. IEEE Geoscience and Remote Sensing Magazine, 2020, 8(4): 125–133. doi: 10.1109/MGRS.2020.3005751.
    [28]
    WEI J, WANG Xuezhi, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1800.
    [29]
    RASSIN R, HIRSCH E, GLICKMAN D, et al. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 157.
    [30]
    SONG Jiaming, MENG Chenlin, and ERMON S. Denoising diffusion implicit models[C]. 9th International Conference on Learning Representations, 2021. (查阅网上资料, 未找到本条文献出版地信息, 请确认).
    [31]
    BAI Shuai, CAI Yuxuan, CHEN Ruizhe, et al. Qwen3-VL technical report[J]. arXiv preprint arXiv: 2511.21631, 2025. doi: 10.48550/arXiv.2511.21631. (查阅网上资料,不确定本条文献类型及格式是否正确,请确认).
    [32]
    Gemini Team Google. Gemini: A family of highly capable multimodal models[J]. arXiv preprint arXiv: 2312.11805, 2025. doi: 10.48550/arXiv.2312.11805. (查阅网上资料,不确定本条文献类型及格式是否正确,请确认).
    [33]
    HUANG Ziqi, HE Yinan, YU Jiashuo, et al. VBench: Comprehensive benchmark suite for video generative models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 21807–21818. doi: 10.1109/CVPR52733.2024.02060.
    [34]
    WANG Fuyun, HUANG Zhaoyang, BIAN Weikang, et al. AnimateLCM: Computation-efficient personalized style video generation without personalized video data[C]. SIGGRAPH Asia 2024 Technical Communications, Tokyo, Japan, 2024: 23. doi: 10.1145/3681758.3698013.
    [35]
    ZHANG Yabo, WEI Yuxiang, LIN Xianhui, et al. VideoElevator: Elevating video generation quality with versatile text-to-image diffusion models[C]. Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 10266–10274. doi: 10.1609/aaai.v39i10.33114.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(9)  / Tables(5)

    Article Metrics

    Article views (74) PDF downloads(14) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return