高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

潜在扩散模型驱动的航空时序图像生成方法

商钰滢 侯英妍 刘子楠 卢宛萱 黄宇鸿 王逸潇 于泓峰 付琨

商钰滢, 侯英妍, 刘子楠, 卢宛萱, 黄宇鸿, 王逸潇, 于泓峰, 付琨. 潜在扩散模型驱动的航空时序图像生成方法[J]. 电子与信息学报. doi: 10.11999/JEIT260165
引用本文: 商钰滢, 侯英妍, 刘子楠, 卢宛萱, 黄宇鸿, 王逸潇, 于泓峰, 付琨. 潜在扩散模型驱动的航空时序图像生成方法[J]. 电子与信息学报. doi: 10.11999/JEIT260165
SHANG Yuying, HOU Yingyan, LIU Zinan, LU Wanxuan, HUANG Yuhong, WANG Yixiao, YU Hongfeng, FU Kun. Aerial Spatio-Temporal Image Generation via Latent Diffusion Model[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260165
Citation: SHANG Yuying, HOU Yingyan, LIU Zinan, LU Wanxuan, HUANG Yuhong, WANG Yixiao, YU Hongfeng, FU Kun. Aerial Spatio-Temporal Image Generation via Latent Diffusion Model[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260165

潜在扩散模型驱动的航空时序图像生成方法

doi: 10.11999/JEIT260165 cstr: 32379.14.JEIT260165
详细信息
    作者简介:

    商钰滢:女,博士生,研究方向为视觉语言理解,多模态语义对齐

    侯英妍:女,博士生,研究方向为遥感图像解译与生成

    刘子楠:男,博士生,研究方向为SAR图像三维重建

    卢宛萱:女,副研究员,研究方向为遥感图像解译

    黄宇鸿:女,博士生,研究方向为遥感图像解译

    王逸潇:男,研究员,研究方向为遥感图像解译

    于泓峰:男,副研究员,研究方向为遥感图像解译

    付琨:男,研究员,研究方向为遥感图像解译,地理空间大数据挖掘

    通讯作者:

    侯英妍 houyy@aircas.ac.cn

  • 中图分类号: TP391/TP183

Aerial Spatio-Temporal Image Generation via Latent Diffusion Model

  • 摘要: 航空对地观测在环境监测、灾害预警和城市规划等领域发挥着关键作用。然而,受限于飞行平台续航能力、任务窗口时效性等现实约束,所获取的航空图像往往难以完整刻画地表的长期演化过程。尽管预训练扩散模型在图像生成领域展现出巨大潜力,但其在航空领域的应用仍面临严峻挑战。一方面,航空观测易受天气及飞行条件制约,高质量的长时序标注样本获取困难,难以支撑模型的有效训练;另一方面,航空平台飞行高度灵活多变,地物尺度跨度大,使得文本描述语义难以与多尺度视觉特征精确匹配,进而引发语义-视觉对齐偏差问题。为应对上述挑战,本文提出了一个用于航空时序图像生成的免训练框架(ASTIG)。该框架首先利用视觉语言模型构建航空时序地表演变描述,并引入动态语义分解过程将复杂的时序描述转化为帧级视觉提示,为推理阶段提供细粒度的语义引导。随后,创新性地提出了一种语言绑定策略,在扩散模型的交叉注意力机制中建立文本中关键地物与对应视觉属性的显式关联,从而增强模型对语义特征的捕捉能力。最后,集成时序锚点注意力机制,利用双参考帧约束确保主体与背景在帧间的时序一致性,有效抑制帧间的时序内容偏移问题。定性和定量实验结果表明,本方法在时序连续性及语义保真度上均优于基线模型,为航空时序图像生成提供了新范式。
  • 图  1  航空时序图像生成的两大核心难点

    图  2  ASTIG框架图

    图  3  航空时序描述数据的收集流程

    图  4  基于“夏秋季节过渡:岩石半岛延伸入海,顶端可见圆形平顶建筑”文本提示的各方法定性对比结果

    图  5  针对“黄昏雪景:网格状建筑群,标志性建筑展现出金色穹顶及古典建筑元素”文本提示的各模块定性消融实验结果

    图  6  针对“悬崖海浪活动”文本提示的干预步数$ {N}_{inv} $定性消融实验结果

    图  7  针对“日落时分密集城市天际线”文本提示的绑定强度$ \eta $定性消融结果

    图  8  卫星视角下ASTIG与Text2Earth的文本驱动图像生成对比结果

    图  9  卫星视角下ASTIG对多种地表变化场景的时序图像生成结果

    表  1  单提示设置下本文方法与基线方法的定量对比结果

    方法SCBCTFMSIQAQ
    LaVie[19]94.8796.3195.1296.8963.1550.23
    AnimateLCM[34]96.1397.6894.0795.9453.4852.87
    CogVideoX[20]95.5897.4898.5498.7649.5748.83
    Videoelevator (LaVie) [35]92.3495.1893.2193.4565.5360.18
    Videoelevator (AnimateLCM) [35]93.9696.2392.4393.5757.1458.23
    Text2Video-Zero[5]89.5593.8985.0387.6474.8251.23
    $ {\text{Free-Bloom}}_{\text{s}} $ [22]96.6897.1393.8994.2177.1961.36
    $ {\text{ASTIG}}_{\text{s}} $97.0197.3496.4797.2478.0361.72
    下载: 导出CSV

    表  2  帧提示设置下本文方法与基线方法的定量对比结果

    方法SCBCTFMSIQAQ
    Free-Bloom92.4794.3992.1493.3876.5862.37
    ASTIG96.3895.8796.7196.0277.0362.80
    下载: 导出CSV

    表  3  各模块消融定量实验结果

    方法SCBCTFMSIQAQ
    ASTIG96.0295.8796.7196.0277.0362.80
    w/o LB95.5395.6795.8495.1275.2157.58
    w/o TAA72.1585.1886.8488.6776.8256.23
    w/o $ \text{At}{\mathrm{t}}_{\text{former}} $96.3895.8396.3895.7476.9861.36
    w/o $ {\text{Att}}_{\text{first}} $89.2191.5894.7296.5375.9859.78
    下载: 导出CSV

    表  4  干预步数$ {N}_{inv} $ 的消融定量实验结果

    $ {N}_{inv} $SCBCTFMSIQAQ
    196.3895.8796.7196.0277.0362.80
    296.7897.5696.4197.2676.8659.65
    397.4197.8696.8197.5675.6658.77
    497.8497.9197.3197.8974.6456.46
    下载: 导出CSV

    表  5  绑定强度 $ \eta $的消融定量实验结果

    SCBCTFMSIQAQ
    096.4395.4795.4296.8176.0360.09
    3096.5895.2896.5497.3376.9261.93
    5096.3895.8796.7196.0277.5462.80
    7096.5296.3995.5697.6777.3161.75
    10095.9796.2895.4397.0276.3561.86
    下载: 导出CSV
  • [1] MOHSAN S A H, OTHMAN N Q H, LI Yanlong, et al. Unmanned aerial vehicles (UAVs): Practical aspects, applications, open challenges, security issues, and future trends[J]. Intelligent Service Robotics, 2023, 16(1): 109–137. doi: 10.1007/s11370-022-00452-4.
    [2] LIU Yidan, YUE Jun, XIA Shaobo, et al. Diffusion models meet remote sensing: Principles, methods, and perspectives[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 4708322. doi: 10.1109/TGRS.2024.3464685.
    [3] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10674–10685. doi: 10.1109/CVPR52688.2022.01042.
    [4] 赵宏, 李文改. 基于扩散生成对抗网络的文本生成图像模型研究[J]. 电子与信息学报, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400.

    ZHAO Hong and LI Wengai. Text-to-image generation model based on diffusion wasserstein generative adversarial networks[J]. Journal of Electronics & Information Technology, 2023, 45(12): 4371–4381. doi: 10.11999/JEIT221400.
    [5] KHACHATRYAN L, MOVSISYAN A, TADEVOSYAN V, et al. Text2Video-zero: Text-to-image diffusion models are zero-shot video generators[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 15908–15918. doi: 10.1109/ICCV51070.2023.01462.
    [6] WU J Z, GE Yixiao, WANG Xintao, et al. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation[C]. IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 7589–7599. doi: 10.1109/ICCV51070.2023.00701.
    [7] KINGMA D P and WELLING M. Auto-encoding variational Bayes[C]. The 2nd International Conference on Learning Representations, Banff, Canada, 2014.
    [8] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139–144. doi: 10.1145/3422622.
    [9] 宋淼, 陈志强, 王培松, 等. DetDiffRS: 面向细节优化的遥感图像超分辨率扩散模型[J]. 电子与信息学报, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995.

    SONG Miao, CHEN Zhiqiang, WANG Peisong, et al. DetDiffRS: A detail-enhanced diffusion model for remote sensing image super-resolution[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4763–4778. doi: 10.11999/JEIT250995.
    [10] BEJIGA M B, MELGANI F, and VASCOTTO A. Retro-remote sensing: Generating images from ancient texts[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019, 12(3): 950–960. doi: 10.1109/JSTARS.2019.2895693.
    [11] CHRISTIE G, FENDLEY N, WILSON J, et al. Functional map of the world[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6172–6180. doi: 10.1109/CVPR.2018.00646.
    [12] VAN ETTEN A, LINDENBAUM D, and BACASTOW T M. SpaceNet: A remote sensing dataset and challenge series[J]. arXiv preprint arXiv: 1807.01232, 2018. doi: 10.48550/arXiv.1807.01232. (查阅网上资料,不确定本条文献类型及格式是否正确,请确认).
    [13] LIU Chenyang, CHEN Keyan, ZHAO Rui, et al. Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model[J]. IEEE Geoscience and Remote Sensing Magazine, 2025, 13(3): 238–259. doi: 10.1109/MGRS.2025.3560455.
    [14] TANG Datao, CAO Xiangyong, WU Xuan, et al. AeroGen: Enhancing remote sensing object detection with diffusion-driven data generation[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2025: 3614–3624. doi: 10.1109/CVPR52734.2025.00342.
    [15] 邓梓焌, 何相腾, 彭宇新. 文本到视频生成: 研究现状、进展和挑战[J]. 电子与信息学报, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074.

    DENG Zijun, HE Xiangteng, and PENG Yuxin. Text-to-video generation: Research status, progress and challenges[J]. Journal of Electronics & Information Technology, 2024, 46(5): 1632–1644. doi: 10.11999/JEIT240074.
    [16] QASIM I, HORSCH A, and PRASAD D. Dense video captioning: A survey of techniques, datasets and evaluation protocols[J]. ACM Computing Surveys, 2025, 57(6): 154. doi: 10.1145/3712059.
    [17] HO J, SALIMANS T, GRITSENKO A, et al. Video diffusion models[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 628.
    [18] BLATTMANN A, ROMBACH R, LING Huan, et al. Align your latents: High-resolution video synthesis with latent diffusion models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 22563–22575. doi: 10.1109/CVPR52729.2023.02161.
    [19] WANG Yaohui, CHEN Xinyuan, MA Xin, et al. LaVie: High-quality video generation with cascaded latent diffusion models[J]. International Journal of Computer Vision, 2025, 133(5): 3059–3078. doi: 10.1007/s11263-024-02295-1.
    [20] YANG Zhuoyi, TENG Jiayan, ZHENG Wendi, et al. CogVideoX: Text-to-video diffusion models with an expert transformer[C]. The 13th International Conference on Learning Representations, Singapore, Singapore, 2025.
    [21] GUO Yuwei, YANG Ceyuan, RAO Anyi, et al. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning[C]. The 12th International Conference on Learning Representations, Vienna, Austria, 2024.
    [22] HUANG Hanzhuo, FENG Yufan, SHI Cheng, et al. Free-bloom: Zero-shot text-to-video generator with LLM director and LDM animator[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 1138.
    [23] ZHU Hanxin, HE Tianyu, TANG Anni, et al. Compositional 3D-aware video generation with LLM director[C]. Proceedings of the 38th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc. , 2024: 4184.
    [24] MA Yong, HAN Huasong, SHEN Shiyuan, et al. T2CV-zero: Zero-shot character video generation via text-to-motion model[C]. International Joint Conference on Neural Networks, Rome, Italy, 2025: 1–8. doi: 10.1109/IJCNN64981.2025.11229098.
    [25] LI Zhen, LI Chuanhao, MAO Xiaofeng, et al. Sekai: A video dataset towards world exploration[J]. arXiv preprint arXiv: 2506.15675, 2025. doi: 10.48550/arXiv.2506.15675. (查阅网上资料,不确定本条文献类型及格式是否正确,请确认).
    [26] WEN Longyin, DU Dawei, ZHU Pengfei, et al. Detection, tracking, and counting meets drones in crowds: A benchmark[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 7808–7817. doi: 10.1109/CVPR46437.2021.00772.
    [27] MOU Lichao, HUA Yuansheng, JIN Pu, et al. ERA: A data set and deep learning benchmark for event recognition in aerial videos [software and data sets][J]. IEEE Geoscience and Remote Sensing Magazine, 2020, 8(4): 125–133. doi: 10.1109/MGRS.2020.3005751.
    [28] WEI J, WANG Xuezhi, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1800.
    [29] RASSIN R, HIRSCH E, GLICKMAN D, et al. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 157.
    [30] SONG Jiaming, MENG Chenlin, and ERMON S. Denoising diffusion implicit models[C]. 9th International Conference on Learning Representations, 2021. (查阅网上资料, 未找到本条文献出版地信息, 请确认).
    [31] BAI Shuai, CAI Yuxuan, CHEN Ruizhe, et al. Qwen3-VL technical report[J]. arXiv preprint arXiv: 2511.21631, 2025. doi: 10.48550/arXiv.2511.21631. (查阅网上资料,不确定本条文献类型及格式是否正确,请确认).
    [32] Gemini Team Google. Gemini: A family of highly capable multimodal models[J]. arXiv preprint arXiv: 2312.11805, 2025. doi: 10.48550/arXiv.2312.11805. (查阅网上资料,不确定本条文献类型及格式是否正确,请确认).
    [33] HUANG Ziqi, HE Yinan, YU Jiashuo, et al. VBench: Comprehensive benchmark suite for video generative models[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 21807–21818. doi: 10.1109/CVPR52733.2024.02060.
    [34] WANG Fuyun, HUANG Zhaoyang, BIAN Weikang, et al. AnimateLCM: Computation-efficient personalized style video generation without personalized video data[C]. SIGGRAPH Asia 2024 Technical Communications, Tokyo, Japan, 2024: 23. doi: 10.1145/3681758.3698013.
    [35] ZHANG Yabo, WEI Yuxiang, LIN Xianhui, et al. VideoElevator: Elevating video generation quality with versatile text-to-image diffusion models[C]. Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 10266–10274. doi: 10.1609/aaai.v39i10.33114.
  • 加载中
图(9) / 表(5)
计量
  • 文章访问数:  16
  • HTML全文浏览量:  4
  • PDF下载量:  3
  • 被引次数: 0
出版历程
  • 收稿日期:  2026-02-06
  • 修回日期:  2026-03-09
  • 录用日期:  2026-04-09
  • 网络出版日期:  2026-04-13

目录

    /

    返回文章
    返回