Aerial Spatio-Temporal Image Generation via Latent Diffusion Models

SHANG Yuying; HOU Yingyan; LIU Zinan; LU Wanxuan; HUANG Yuhong; WANG Yixiao; YU Hongfeng; FU Kun

doi:10.11999/JEIT260165

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 >

SHANG Yuying, HOU Yingyan, LIU Zinan, LU Wanxuan, HUANG Yuhong, WANG Yixiao, YU Hongfeng, FU Kun. Aerial Spatio-Temporal Image Generation via Latent Diffusion Models[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260165

Citation:

SHANG Yuying, HOU Yingyan, LIU Zinan, LU Wanxuan, HUANG Yuhong, WANG Yixiao, YU Hongfeng, FU Kun. Aerial Spatio-Temporal Image Generation via Latent Diffusion Models[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260165

Citation:

PDF( 4980 KB)

Aerial Spatio-Temporal Image Generation via Latent Diffusion Models

doi: 10.11999/JEIT260165 cstr: 32379.14.JEIT260165

SHANG Yuying^{1, 2, 3, 4},
HOU Yingyan^{1, 2, 3, 4
,
,},
LIU Zinan^{1, 2, 3},
LU Wanxuan^{1, 4},
HUANG Yuhong^{5, 1, 3, 4},
WANG Yixiao^{1, 2, 3, 4},
YU Hongfeng^{1, 4},
FU Kun^{1, 3, 4}

1.
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, 100094, China
2.
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, 100190, China
3.
University of Chinese Academy of Sciences, Beijing, 100190, China
4.
The Key Laboratory of Target Cognition and Application Technology (TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, 100094, China
5.
School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, 100190, China

Funds: The Strategic Priority Research Program of the Chinese Academy of Sciences (XDB1600000, XDB1600102)

Received Date: 2026-02-06
Accepted Date: 2026-04-09
Rev Recd Date: 2026-03-09

Available Online: 2026-04-13

Abstract

Abstract

Objective Aerial Earth observation plays a pivotal role in environmental monitoring, disaster warning, and urban planning. However, constraints such as flight-platform endurance and mission-window timeliness often prevent acquired aerial imagery from fully characterizing the long-term evolution of the Earth's surface. Although pre-trained latent diffusion models have shown strong potential for image generation, their application in aerial scenarios remains challenging because of the scarcity of high-quality temporal annotation data and semantic-visual misalignment caused by variable observation scales. To address these challenges, this paper proposes ASTIG, a training-free framework for Aerial Spatio-Temporal Image Generation. By leveraging the generative priors of pre-trained latent diffusion models and Large Language Models (LLMs), ASTIG provides a new paradigm for semantically controllable aerial spatio-temporal image generation. Methods ASTIG consists of three coordinated components. First, a dynamic semantic decomposition process is proposed to parse complex descriptions of aerial scene evolution into frame-level visual prompts, thereby compensating for the lack of temporal semantic annotations in existing aerial image-text datasets. Second, a Linguistic Binding (LB) strategy is proposed to establish explicit associations between key ground objects and their corresponding visual attributes within the cross-attention mechanism of the diffusion model, thereby improving the semantic response precision of the generated images. Third, a Temporal Anchor Attention (TAA) mechanism is incorporated. It uses dual reference frames to maintain subject stability and background consistency across the generated spatio-temporal image sequence, thus suppressing inter-frame temporal drift under training-free conditions. Results and Discussions ASTIG and the baseline methods are evaluated on 7,236 high-quality aerial spatio-temporal descriptions using six automated metrics, including subject consistency, background consistency, temporal flickering, motion smoothness, aesthetic quality, and imaging quality. Quantitative results (Tables 1 and 2) show that ASTIG outperforms the baseline methods in spatio-temporal image generation, with improvements of 3.91% in subject consistency and 4.57% in motion smoothness over the frame-prompt baseline. Qualitative comparisons (Fig. 4) further show its strong ability to model long-term surface evolution in aerial imagery. Ablation studies validate the individual effectiveness of the LB strategy and the TAA mechanism (Table 3 and Fig. 5). Sensitivity analyses of the intervention steps (Table 4 and Fig. 6) and binding strength (Table 5 and Fig. 7) further identify suitable parameter settings. Extension experiments from satellite perspectives (Figs. 8 and 9) also show that ASTIG has the potential to generalize beyond aerial platforms to broader Earth observation scenarios. Conclusions This paper proposes ASTIG, a training-free framework for aerial spatio-temporal image generation that addresses the scarcity of high-quality long-term temporal data and semantic-visual misalignment. By leveraging the generative priors of pre-trained latent diffusion models and LLMs, ASTIG integrates a dynamic semantic decomposition process, an LB strategy, and a TAA mechanism to improve temporal semantic construction, semantic response precision, and inter-frame consistency. Experimental results show that ASTIG outperforms existing baseline methods across multiple automated evaluation metrics, providing a new paradigm for aerial spatio-temporal image generation. As a training-free method, ASTIG is still limited by the prior knowledge of the backbone model. Future work will examine geometric correction and nadir-view prior constraints to better align the generated results with the physical properties of satellite imagery.
- Aerial spatio-temporal image generation,
- Latent diffusion models,
- Text-to-image generation,
- Large Language Models,
- Training-free generation

FullText(HTML)

References(35)

References

[1]	MOHSAN S A H, OTHMAN N Q H, LI Yanlong, et al. Unmanned aerial vehicles (UAVs): Practical aspects, applications, open challenges, security issues, and future trends[J]. Intelligent Service Robotics, 2023, 16(1): 109–137. doi: 10.1007/s11370-022-00452-4.
[2]	LIU Yidan, YUE Jun, XIA Shaobo, et al. Diffusion models meet remote sensing: Principles, methods, and perspectives[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 4708322. doi: 10.1109/TGRS.2024.3464685.
[3]	ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10674–10685. doi: 10.1109/CVPR52688.2022.01042.
[4]	赵宏, 李文改. 基于扩散生成对抗网络的文本生成图像模型研究[J]. 电子与信息学报, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400. ZHAO Hong and LI Wengai. Text-to-image generation model based on diffusion wasserstein generative adversarial networks[J]. Journal of Electronics & Information Technology, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400.
[5]	KHACHATRYAN L, MOVSISYAN A, TADEVOSYAN V, et al. Text2Video-zero: Text-to-image diffusion models are zero-shot video generators[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 15908–15918. doi: 10.1109/ICCV51070.2023.01462.
[6]	WU J Z, GE Yixiao, WANG Xintao, et al. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 7589–7599. doi: 10.1109/ICCV51070.2023.00701.
[7]	KINGMA D P and WELLING M. Auto-encoding variational Bayes[C]. The 2nd International Conference on Learning Representations, Banff, Canada, 2014.
[8]	GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139–144. doi: 10.1145/3422622.
[9]	宋淼, 陈志强, 王培松, 等. DetDiffRS: 面向细节优化的遥感图像超分辨率扩散模型[J]. 电子与信息学报, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995. SONG Miao, CHEN Zhiqiang, WANG Peisong, et al. DetDiffRS: A detail-enhanced diffusion model for remote sensing image super-resolution[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995.
[10]	BEJIGA M B, MELGANI F, and VASCOTTO A. Retro-remote sensing: Generating images from ancient texts[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019, 12(3): 950–960. doi: 10.1109/JSTARS.2019.2895693.
[11]	CHRISTIE G, FENDLEY N, WILSON J, et al. Functional map of the world[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6172–6180. doi: 10.1109/CVPR.2018.00646.
[12]	VAN ETTEN A, LINDENBAUM D, and BACASTOW T M. SpaceNet: A remote sensing dataset and challenge series[J]. arXiv preprint arXiv: 1807.01232, 2018. doi: 10.48550/arXiv.1807.01232.
[13]	LIU Chenyang, CHEN Keyan, ZHAO Rui, et al. Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model[J]. IEEE Geoscience and Remote Sensing Magazine, 2025, 13(3): 238–259. doi: 10.1109/MGRS.2025.3560455.
[14]	TANG Datao, CAO Xiangyong, WU Xuan, et al. AeroGen: Enhancing remote sensing object detection with diffusion-driven data generation[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2025: 3614–3624. doi: 10.1109/CVPR52734.2025.00342.
[15]	邓梓焌, 何相腾, 彭宇新. 文本到视频生成: 研究现状、进展和挑战[J]. 电子与信息学报, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074. DENG Zijun, HE Xiangteng, and PENG Yuxin. Text-to-video generation: Research status, progress and challenges[J]. Journal of Electronics & Information Technology, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074.
[16]	QASIM I, HORSCH A, and PRASAD D. Dense video captioning: A survey of techniques, datasets and evaluation protocols[J]. ACM Computing Surveys, 2025, 57(6): 154. doi: 10.1145/3712059.
[17]	HO J, SALIMANS T, GRITSENKO A, et al. Video diffusion models[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 628.
[18]	BLATTMANN A, ROMBACH R, LING Huan, et al. Align your latents: High-resolution video synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 22563–22575. doi: 10.1109/CVPR52729.2023.02161.
[19]	WANG Yaohui, CHEN Xinyuan, MA Xin, et al. LaVie: High-quality video generation with cascaded latent diffusion models[J]. International Journal of Computer Vision, 2025, 133(5): 3059–3078. doi: 10.1007/s11263-024-02295-1.
[20]	YANG Zhuoyi, TENG Jiayan, ZHENG Wendi, et al. CogVideoX: Text-to-video diffusion models with an expert transformer[C]. The 13th International Conference on Learning Representations, Singapore, Singapore, 2025.
[21]	GUO Yuwei, YANG Ceyuan, RAO Anyi, et al. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning[C]. The 12th International Conference on Learning Representations, Vienna, Austria, 2024.
[22]	HUANG Hanzhuo, FENG Yufan, SHI Cheng, et al. Free-bloom: Zero-shot text-to-video generator with LLM director and LDM animator[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 1138.
[23]	ZHU Hanxin, HE Tianyu, TANG Anni, et al. Compositional 3D-aware video generation with LLM director[C]. The 38th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. , 2024: 4184.
[24]	MA Yong, HAN Huasong, SHEN Shiyuan, et al. T2CV-zero: Zero-shot character video generation via text-to-motion model[C]. International Joint Conference on Neural Networks, Rome, Italy, 2025: 1–8. doi: 10.1109/IJCNN64981.2025.11229098.
[25]	LI Zhen, LI Chuanhao, MAO Xiaofeng, et al. Sekai: A video dataset towards world exploration[J]. arXiv preprint arXiv: 2506.15675, 2025. doi: 10.48550/arXiv.2506.15675.
[26]	WEN Longyin, DU Dawei, ZHU Pengfei, et al. Detection, tracking, and counting meets drones in crowds: A benchmark[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 7808–7817. doi: 10.1109/CVPR46437.2021.00772.
[27]	MOU Lichao, HUA Yuansheng, JIN Pu, et al. ERA: A data set and deep learning benchmark for event recognition in aerial videos [software and data sets][J]. IEEE Geoscience and Remote Sensing Magazine, 2020, 8(4): 125–133. doi: 10.1109/MGRS.2020.3005751.
[28]	WEI J, WANG Xuezhi, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1800.
[29]	RASSIN R, HIRSCH E, GLICKMAN D, et al. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 157.
[30]	SONG Jiaming, MENG Chenlin, and ERMON S. Denoising diffusion implicit models[C/OL]. The 9th International Conference on Learning Representations, 2021.
[31]	BAI Shuai, CAI Yuxuan, CHEN Ruizhe, et al. Qwen3-VL technical report[J]. arXiv preprint arXiv: 2511.21631, 2025. doi: 10.48550/arXiv.2511.21631.
[32]	Gemini Team Google. Gemini: A family of highly capable multimodal models[J]. arXiv preprint arXiv: 2312.11805, 2025. doi: 10.48550/arXiv.2312.11805.
[33]	HUANG Ziqi, HE Yinan, YU Jiashuo, et al. VBench: Comprehensive benchmark suite for video generative models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 21807–21818. doi: 10.1109/CVPR52733.2024.02060.
[34]	WANG Fuyun, HUANG Zhaoyang, BIAN Weikang, et al. AnimateLCM: Computation-efficient personalized style video generation without personalized video data[C]. SIGGRAPH Asia 2024 Technical Communications, Tokyo, Japan, 2024: 23. doi: 10.1145/3681758.3698013.
[35]	ZHANG Yabo, WEI Yuxiang, LIN Xianhui, et al. VideoElevator: Elevating video generation quality with versatile text-to-image diffusion models[C]. The 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 10266–10274. doi: 10.1609/aaai.v39i10.33114.