Advanced Search
Turn off MathJax
Article Contents
GU Guanghua, SUN Wenxing, YI Boyu. Multi-code Deep Fusion Attention Generative Adversarial Networks for Text-to-Image Synthesis[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250516
Citation: GU Guanghua, SUN Wenxing, YI Boyu. Multi-code Deep Fusion Attention Generative Adversarial Networks for Text-to-Image Synthesis[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250516

Multi-code Deep Fusion Attention Generative Adversarial Networks for Text-to-Image Synthesis

doi: 10.11999/JEIT250516 cstr: 32379.14.JEIT250516
Funds:  The National Natural Science Foundation of China(62072394), Hebei Provincial Natural Science Foundation (F2024203049)
  • Accepted Date: 2025-12-22
  • Rev Recd Date: 2025-12-22
  • Available Online: 2025-12-29
  •   Objective  Text-to-image synthesis, a cornerstone task in multimodal AI, focuses on generating photorealistic images that accurately reflect natural language descriptions. This capability is crucial for applications ranging from creative design and education to data augmentation and human-computer interaction. However, achieving high fidelity in both visual detail and semantic alignment remains a formidable challenge. Predominant Generative Adversarial Network (GAN) based approaches typically condition generation on a single latent noise vector, a design that often proves inadequate for capturing the rich diversity of visual elements described in text. This can lead to outputs that lack intricate textures, nuanced colors, or specific structural details. Simultaneously, while attention mechanisms have improved semantic grounding, many models employ simplistic, single-focus attention that fails to model the complex, many-to-many relationships inherent in language—where a single phrase may describe multiple visual components, or a composite image region may be semantically linked to several words. These combined limitations result in a perceptible gap between textual intent and visual output. To bridge this gap, this work introduces a novel GAN architecture, the Multi-code Deep Feature Fusion Attention GAN (mDFA-GAN). Its primary objective is to fundamentally enhance the synthesis pipeline by simultaneously enriching the source of visual variation through multiple noise codes and deepening the semantic reasoning process via a multifaceted attention mechanism, thereby setting a new benchmark for detail accuracy and textual faithfulness in generated imagery.  Methods  This paper proposes the Multi-code Deep Feature Fusion Attention Generative Adversarial Network (mDFA-GAN). The architecture introduces three key innovations within its generator. First, a multi-noise input mechanism is employed, utilizing several independent noise vectors instead of a single one. This provides a richer latent foundation, with each noise potentially specializing in different visual aspects like structure, texture, or color. Second, to effectively integrate information from these multiple sources, a Multi-code Prior Fusion Module is designed. It operates on intermediate feature maps using learnable, adaptive channel-wise weights, performing a weighted summation to dynamically combine distinct visual concepts into a coherent, detail-enriched representation. Third, a Multi-head Attention Module is integrated into the generator's later stages. Unlike conventional single-head attention that often links a region to one word, this module computes attention between image features and word embeddings across multiple independent heads. This allows each image region to be informed by a context derived from several relevant words, enabling more nuanced and accurate semantic grounding. The model is trained with a unidirectional discriminator using a conditional hinge loss augmented with a Matching-Aware zero-centered Gradient Penalty (MA-GP) to stabilize training and enforce text-image matching. Furthermore, a dedicated multi-code fusion loss is introduced to minimize variance among features from different noise inputs, ensuring their spatial and semantic coherence for stable, high-quality synthesis.  Results and Discussions  The proposed mDFA-GAN is evaluated on the CUB-200-2011 and MS COCO datasets. Qualitative results, as shown in (Fig. 7) and (Fig. 8), demonstrate its superior capability in generating images with accurate colors, fine-grained details (e.g., specific plumage patterns, object shapes), and coherent complex scenes. It successfully renders subtle textual attributes often missed by baselines. Quantitatively, mDFA-GAN achieves state-of-the-art performance. It obtains the highest Inception Score (IS) of 4.82 on CUB (Table 2), indicating better perceptual quality and alignment with text. Crucially, it achieves the lowest Fréchet Inception Distance (FID) scores of 13.45 on CUB and 16.50 on COCO (Table 3), outperforming established GANs like DF-GAN, AttnGAN, and DM-GAN. This confirms that its generated images are statistically closer to real images. Ablation studies provide clear evidence for each component's contribution. Removing either the fusion module or the attention module degrades performance (Table 4), validating their roles in enhancing detail and semantic consistency, respectively. The proposed multi-code fusion loss is also shown to be essential for training stability and output quality (Table 5). An analysis on the number of noise codes identifies three as optimal (Table 6). Regarding efficiency, mDFA-GAN maintains a fast inference speed of 0.8 seconds per image (Table 7), retaining the advantage of GANs over slower diffusion models while offering significantly improved quality.  Conclusions  In this paper, we propose mDFA-GAN, a novel and highly effective framework for text-to-image synthesis that makes significant strides in overcoming two persistent challenges: limited fine-grained detail and imperfect semantic alignment. By architecturally decoupling the latent representation into multiple specialized noise codes and fusing them adaptively, the model gains a superior capacity for generating intricate visual details. By employing a multi-head cross-modal attention mechanism, it achieves a deeper, more context-aware understanding of the text, leading to precise semantic grounding. Extensive experiments on established benchmarks demonstrate that mDFA-GAN sets a new state-of-the-art for GAN-based methods, as quantitatively validated by superior IS and FID scores and qualitatively evidenced by highly detailed and semantically faithful image samples. The comprehensive ablation analyses conclusively validate the necessity and synergy of each proposed component. This work not only presents a powerful model for text-to-image generation but also offers architectural insights that could inform future research in multimodal representation learning and detailed visual synthesis.
  • loading
  • [1]
    GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139–144. doi: 10.1145/3422622.
    [2]
    TAO Ming, TANG Hao, WU Fei, et al. DF-GAN: A simple and effective baseline for text-to-image synthesis[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 16494–16504. doi: 10.1109/CVPR52688.2022.01602.
    [3]
    XU Tao, ZHANG Pengchuan, HUANG Qiuyuan, et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 1316–1324. doi: 10.1109/CVPR.2018.00143.
    [4]
    VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, USA, 2017: 6000–6010.
    [5]
    XUE A. End-to-end Chinese landscape painting creation using generative adversarial networks[C]. Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2021: 3862–3870. doi: 10.1109/WACV48630.2021.00391.
    [6]
    SHAHRIAR S. GAN computers generate arts? A survey on visual arts, music, and literary text generation using generative adversarial network[J]. Displays, 2022, 73: 102237. doi: 10.1016/j.displa.2022.102237.
    [7]
    ISOLA P, ZHU Junyan, ZHOU Tinghui, et al. Image-to-image translation with conditional adversarial networks[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 5967–5976. doi: 10.1109/CVPR.2017.632.
    [8]
    ALOTAIBI A. Deep generative adversarial networks for image-to-image translation: A review[J]. Symmetry, 2020, 12(10): 1705. doi: 10.3390/sym12101705.
    [9]
    XIA Weihao, YANG Yujiu, XUE Jinghao, et al. TEDIGAN: Text-guided diverse face image generation and manipulation[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 2256–2265. doi: 10.1109/CVPR46437.2021.00229.
    [10]
    KOCASARI U, DIRIK A, TIFTIKCI M, et al. StyleMC: Multi-channel based fast text-guided image generation and manipulation[C]. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2022: 3441–3450. doi: 10.1109/WACV51458.2022.00350.
    [11]
    SAHARIA C, CHAN W, CHANG H, et al. Photorealistic text-to-image diffusion models with deep language understanding[C]. IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023: 15679–15689. (查阅网上资料, 未找到本条文献信息, 请确认).
    [12]
    ZHANG Han, XU Tao, LI Hongsheng, et al. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks[C]. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5908–5916. doi: 10.1109/ICCV.2017.629.
    [13]
    ZHANG Han, XU Tao, LI Hongsheng, et al. StackGAN++: Realistic image synthesis with stacked generative adversarial networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1947–1962. doi: 10.1109/TPAMI.2018.2856256.
    [14]
    LIAO Wentong, HU Kai, YANG M Y, et al. Text to image generation with semantic-spatial aware GAN[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 18166–18175. doi: 10.1109/CVPR52688.2022.01765.
    [15]
    TAO Ming, BAO Bingkun, TANG Hao, et al. GALIP: Generative adversarial CLIPs for text-to-image synthesis[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 14214–14223. doi: 10.1109/CVPR52729.2023.01366.
    [16]
    LU Cheng, ZHOU Yuhao, BAO Fan, et al. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models[J]. Machine Intelligence Research, 2025, 22(4): 730–751. doi: 10.1007/s11633-025-1562-4.
    [17]
    DING Ming, YANG Zhuoyi, HONG Wenyi, et al. CogView: Mastering text-to-image generation via transformers[C]. Proceedings of the 35th Conference on Neural Information Processing Systems, 2021: 19822–19835. (查阅网上资料, 未找到本条文献出版地信息, 请确认).
    [18]
    ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10674–10685. doi: 10.1109/CVPR52688.2022.01042.
    [19]
    ZHAO Liang, HUANG Pingda, CHEN Tengtuo, et al. Multi-sentence complementarily generation for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2024, 26: 8323–8332. doi: 10.1109/TMM.2023.3297769.
    [20]
    DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, USA, 2019: 4171–4186. doi: 10.18653/v1/N19-1423.
    [21]
    LI Bowen, QI Xiaojuan, LUKASIEWICZ T, et al. Controllable text-to-image generation[C]. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, 2019: 185.
    [22]
    RUAN Shulan, ZHANG Yong, ZHANG Kun, et al. DAE-GAN: Dynamic aspect-aware GAN for text-to-image synthesis[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 13940–13949. doi: 10.1109/ICCV48922.2021.01370.
    [23]
    ZHANG L, ZHANG Y, LIU X, et al. Fine-grained text-to-image synthesis via semantic pyramid alignment[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023: 23415–23425. (查阅网上资料, 未找到本条文献信息, 请确认).
    [24]
    CHEN J, LIU Y, WANG H, et al. Improving text-image semantic consistency in generative adversarial networks via contrastive learning[J]. IEEE Transactions on Multimedia, 2024, 26: 5102–5113. doi: 10.1109/TMM.2024.3356781. (查阅网上资料,未找到本条文献信息,请确认).
    [25]
    DENG Zhijun, HE Xiangteng, and PENG Yuxin. LFR-GAN: Local feature refinement based generative adversarial network for text-to-image generation[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(5): 207. doi: 10.1145/358900.
    [26]
    YANG Bing, XIANG Xueqin, KONG Wangzeng, et al. DMF-GAN: Deep multimodal fusion generative adversarial networks for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2024, 26: 6956–6967. doi: 10.1109/TMM.2024.3358086.
    [27]
    WANG Z, ZHOU Y, SHI B, et al. Advances in controllable and disentangled representation learning for generative models[J]. International Journal of Computer Vision, 2023, 131(5): 1245–1263. doi: 10.1007/s11263-023-01785-y. (查阅网上资料,未找到本条文献信息,请确认).
    [28]
    YUAN M and PENG Y. T2I-CompBench: A comprehensive benchmark for open-world compositional text-to-image generation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(5): 2754–2769. doi: 10.1109/TPAMI.2023.3330805. (查阅网上资料,未找到本条文献信息,请确认).
    [29]
    SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs[C]. Proceedings of the 30th Conference on Neural Information Processing Systems, Barcelona, Spain, 2016: 2234–2242.
    [30]
    HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, USA, 2017: 6629–6640.
    [31]
    TAN Hongchen, LIU Xiuping, YIN Baocai, et al. DR-GAN: Distribution regularization for text-to-image generation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(12): 10309–10323. doi: 10.1109/TNNLS.2022.3165573.
    [32]
    WAH C, BRANSON S, WELINDER P, et al. The Caltech-UCSD birds-200–2011 dataset[R]. CNS-TR-2010-001, 2011.
    [33]
    LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[C]. Proceedings of 13th European Conference on Computer Vision -- ECCV 2014, Zurich, Switzerland, 2014: 740–755. doi: 10.1007/978-3-319-10602-1_48.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(7)  / Tables(5)

    Article Metrics

    Article views (42) PDF downloads(7) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return