高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于多码深度特征融合生成对抗网络的文本生成图像方法

顾广华 孙文星 伊柏宇

顾广华, 孙文星, 伊柏宇. 基于多码深度特征融合生成对抗网络的文本生成图像方法[J]. 电子与信息学报. doi: 10.11999/JEIT250516
引用本文: 顾广华, 孙文星, 伊柏宇. 基于多码深度特征融合生成对抗网络的文本生成图像方法[J]. 电子与信息学报. doi: 10.11999/JEIT250516
GU Guanghua, SUN Wenxing, YI Boyu. Multi-code Deep Fusion Attention Generative Adversarial Networks for Text-to-Image Synthesis[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250516
Citation: GU Guanghua, SUN Wenxing, YI Boyu. Multi-code Deep Fusion Attention Generative Adversarial Networks for Text-to-Image Synthesis[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250516

基于多码深度特征融合生成对抗网络的文本生成图像方法

doi: 10.11999/JEIT250516 cstr: 32379.14.JEIT250516
基金项目: 国家自然科学基金 (62072394),河北省自然科学基金 (F2024203049)
详细信息
    作者简介:

    顾广华:男,教授,研究方向为图像语义理解、多媒体跨模态检索

    孙文星:女,硕士研究生,研究方向为文本生成图像

    伊柏宇:男,硕士研究生,研究方向为计算机视觉

    通讯作者:

    顾广华 guguanghua@ysu.edu.cn

  • 中图分类号: TP391

Multi-code Deep Fusion Attention Generative Adversarial Networks for Text-to-Image Synthesis

Funds: The National Natural Science Foundation of China(62072394), Hebei Provincial Natural Science Foundation (F2024203049)
  • 摘要: 文本生成图像是一项极具挑战的跨模态任务,其核心在于生成与文本描述高度一致、细节丰富的高质量图像。当前基于生成对抗网络的方法多依赖单一噪声输入,导致生成图像细粒度不足;同时,单词级特征利用不充分,也制约了文本与图像之间的语义对齐精度。为此,本文提出一种多码深度特征融合生成对抗网络(mDFA-GAN)。该方法通过设计多噪声输入生成器与多码先验融合模块,提升生成图像的细节表现力;在生成器中引入多头注意力机制,从多角度对齐单词与图像子区域,增强语义一致性;此外,提出多码先验融合损失以稳定训练过程。在CUB和COCO数据集上的实验结果表明,本文方法在IS与FID评价指标上均优于当前主流生成对抗网络方法,能够生成更逼真、细节更丰富、语义一致性更强的图像。
  • 图  1  模型整体结构

    图  2  多码深度融合模块

    图  3  深度文本图像融合模块

    图  4  多头注意力结构图

    图  5  鉴别器结构图

    图  6  CUB数据集生成图像结果图

    图  7  COCO数据集生成图像结果图

    表  1  在CUB数据集上与前沿算法相比IS得分情况表

    方法Inception Score(IS)↑
    StackGAN++4.04±0.06
    AttnGAN4.36±0.03
    DM-GAN4.47±0.19
    DR-GAN4.66±0.15
    DF-GAN4.61±0.12
    SSA-GAN4.70±0.08
    Ours4.82±0.10
    下载: 导出CSV

    表  2  在CUB数据集和COCO数据集上与前沿算法相比FID得分情况表

    方法CUB-FID↓COCO-FID↓
    StackGAN++15.3081.59
    AttnGAN23.9835.49
    DM-GAN16.0932.64
    DR-GAN14.9627.80
    DF-GAN14.8119.32
    SSA-GAN15.6119.37
    Cogview-27.10
    LDM-8-23.31
    Ours13.4516.50
    下载: 导出CSV

    表  3  模型组成部分的消融实验

    组成部分CUB-FID↓COCO-FID↓
    基线模型多码融合模块多头注意力
    --14.8119.32
    -14.2117.94
    -13.7517.19
    13.4516.50
    下载: 导出CSV

    表  4  噪声n的数量选择实验

    FID n
    1 2 3 4 5
    CUB-FID↓ 14.21 13.67 13.45 13.65 13.70
    COCO-FID↓ 17.94 16.95 16.50 16.86 16.90
    下载: 导出CSV

    表  5  推理速度对比

    方法DF-GANStable DiffusionDALLE2Ours
    推理速度0.5 s1.4 s2 s0.8 s
    下载: 导出CSV
  • [1] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139–144. doi: 10.1145/3422622.
    [2] TAO Ming, TANG Hao, WU Fei, et al. DF-GAN: A simple and effective baseline for text-to-image synthesis[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 16494–16504. doi: 10.1109/CVPR52688.2022.01602.
    [3] XU Tao, ZHANG Pengchuan, HUANG Qiuyuan, et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 1316–1324. doi: 10.1109/CVPR.2018.00143.
    [4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, USA, 2017: 6000–6010.
    [5] XUE A. End-to-end Chinese landscape painting creation using generative adversarial networks[C]. Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2021: 3862–3870. doi: 10.1109/WACV48630.2021.00391.
    [6] SHAHRIAR S. GAN computers generate arts? A survey on visual arts, music, and literary text generation using generative adversarial network[J]. Displays, 2022, 73: 102237. doi: 10.1016/j.displa.2022.102237.
    [7] ISOLA P, ZHU Junyan, ZHOU Tinghui, et al. Image-to-image translation with conditional adversarial networks[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 5967–5976. doi: 10.1109/CVPR.2017.632.
    [8] ALOTAIBI A. Deep generative adversarial networks for image-to-image translation: A review[J]. Symmetry, 2020, 12(10): 1705. doi: 10.3390/sym12101705.
    [9] XIA Weihao, YANG Yujiu, XUE Jinghao, et al. TEDIGAN: Text-guided diverse face image generation and manipulation[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 2256–2265. doi: 10.1109/CVPR46437.2021.00229.
    [10] KOCASARI U, DIRIK A, TIFTIKCI M, et al. StyleMC: Multi-channel based fast text-guided image generation and manipulation[C]. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2022: 3441–3450. doi: 10.1109/WACV51458.2022.00350.
    [11] SAHARIA C, CHAN W, CHANG H, et al. Photorealistic text-to-image diffusion models with deep language understanding[C]. IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023: 15679–15689. (查阅网上资料, 未找到本条文献信息, 请确认).
    [12] ZHANG Han, XU Tao, LI Hongsheng, et al. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks[C]. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5908–5916. doi: 10.1109/ICCV.2017.629.
    [13] ZHANG Han, XU Tao, LI Hongsheng, et al. StackGAN++: Realistic image synthesis with stacked generative adversarial networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1947–1962. doi: 10.1109/TPAMI.2018.2856256.
    [14] LIAO Wentong, HU Kai, YANG M Y, et al. Text to image generation with semantic-spatial aware GAN[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 18166–18175. doi: 10.1109/CVPR52688.2022.01765.
    [15] TAO Ming, BAO Bingkun, TANG Hao, et al. GALIP: Generative adversarial CLIPs for text-to-image synthesis[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 14214–14223. doi: 10.1109/CVPR52729.2023.01366.
    [16] LU Cheng, ZHOU Yuhao, BAO Fan, et al. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models[J]. Machine Intelligence Research, 2025, 22(4): 730–751. doi: 10.1007/s11633-025-1562-4.
    [17] DING Ming, YANG Zhuoyi, HONG Wenyi, et al. CogView: Mastering text-to-image generation via transformers[C]. Proceedings of the 35th Conference on Neural Information Processing Systems, 2021: 19822–19835. (查阅网上资料, 未找到本条文献出版地信息, 请确认).
    [18] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10674–10685. doi: 10.1109/CVPR52688.2022.01042.
    [19] ZHAO Liang, HUANG Pingda, CHEN Tengtuo, et al. Multi-sentence complementarily generation for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2024, 26: 8323–8332. doi: 10.1109/TMM.2023.3297769.
    [20] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, USA, 2019: 4171–4186. doi: 10.18653/v1/N19-1423.
    [21] LI Bowen, QI Xiaojuan, LUKASIEWICZ T, et al. Controllable text-to-image generation[C]. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, 2019: 185.
    [22] RUAN Shulan, ZHANG Yong, ZHANG Kun, et al. DAE-GAN: Dynamic aspect-aware GAN for text-to-image synthesis[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 13940–13949. doi: 10.1109/ICCV48922.2021.01370.
    [23] ZHANG L, ZHANG Y, LIU X, et al. Fine-grained text-to-image synthesis via semantic pyramid alignment[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023: 23415–23425. (查阅网上资料, 未找到本条文献信息, 请确认).
    [24] CHEN J, LIU Y, WANG H, et al. Improving text-image semantic consistency in generative adversarial networks via contrastive learning[J]. IEEE Transactions on Multimedia, 2024, 26: 5102–5113. doi: 10.1109/TMM.2024.3356781. (查阅网上资料,未找到本条文献信息,请确认).
    [25] DENG Zhijun, HE Xiangteng, and PENG Yuxin. LFR-GAN: Local feature refinement based generative adversarial network for text-to-image generation[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(5): 207. doi: 10.1145/358900.
    [26] YANG Bing, XIANG Xueqin, KONG Wangzeng, et al. DMF-GAN: Deep multimodal fusion generative adversarial networks for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2024, 26: 6956–6967. doi: 10.1109/TMM.2024.3358086.
    [27] WANG Z, ZHOU Y, SHI B, et al. Advances in controllable and disentangled representation learning for generative models[J]. International Journal of Computer Vision, 2023, 131(5): 1245–1263. doi: 10.1007/s11263-023-01785-y. (查阅网上资料,未找到本条文献信息,请确认).
    [28] YUAN M and PENG Y. T2I-CompBench: A comprehensive benchmark for open-world compositional text-to-image generation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(5): 2754–2769. doi: 10.1109/TPAMI.2023.3330805. (查阅网上资料,未找到本条文献信息,请确认).
    [29] SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs[C]. Proceedings of the 30th Conference on Neural Information Processing Systems, Barcelona, Spain, 2016: 2234–2242.
    [30] HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, USA, 2017: 6629–6640.
    [31] TAN Hongchen, LIU Xiuping, YIN Baocai, et al. DR-GAN: Distribution regularization for text-to-image generation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(12): 10309–10323. doi: 10.1109/TNNLS.2022.3165573.
    [32] WAH C, BRANSON S, WELINDER P, et al. The Caltech-UCSD birds-200–2011 dataset[R]. CNS-TR-2010-001, 2011.
    [33] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[C]. Proceedings of 13th European Conference on Computer Vision -- ECCV 2014, Zurich, Switzerland, 2014: 740–755. doi: 10.1007/978-3-319-10602-1_48.
  • 加载中
图(7) / 表(5)
计量
  • 文章访问数:  39
  • HTML全文浏览量:  15
  • PDF下载量:  6
  • 被引次数: 0
出版历程
  • 修回日期:  2025-12-22
  • 录用日期:  2025-12-22
  • 网络出版日期:  2025-12-29

目录

    /

    返回文章
    返回