Multi-code Deep Fusion Attention Generative Adversarial Networks for Text-to-Image Synthesis
-
摘要: 文本生成图像是一项极具挑战的跨模态任务,其核心在于生成与文本描述高度一致、细节丰富的高质量图像。当前基于生成对抗网络的方法多依赖单一噪声输入,导致生成图像细粒度不足;同时,单词级特征利用不充分,也制约了文本与图像之间的语义对齐精度。为此,本文提出一种多码深度特征融合生成对抗网络(mDFA-GAN)。该方法通过设计多噪声输入生成器与多码先验融合模块,提升生成图像的细节表现力;在生成器中引入多头注意力机制,从多角度对齐单词与图像子区域,增强语义一致性;此外,提出多码先验融合损失以稳定训练过程。在CUB和COCO数据集上的实验结果表明,本文方法在IS与FID评价指标上均优于当前主流生成对抗网络方法,能够生成更逼真、细节更丰富、语义一致性更强的图像。Abstract:
Objective Text-to-image synthesis, a cornerstone task in multimodal AI, focuses on generating photorealistic images that accurately reflect natural language descriptions. This capability is crucial for applications ranging from creative design and education to data augmentation and human-computer interaction. However, achieving high fidelity in both visual detail and semantic alignment remains a formidable challenge. Predominant Generative Adversarial Network (GAN) based approaches typically condition generation on a single latent noise vector, a design that often proves inadequate for capturing the rich diversity of visual elements described in text. This can lead to outputs that lack intricate textures, nuanced colors, or specific structural details. Simultaneously, while attention mechanisms have improved semantic grounding, many models employ simplistic, single-focus attention that fails to model the complex, many-to-many relationships inherent in language—where a single phrase may describe multiple visual components, or a composite image region may be semantically linked to several words. These combined limitations result in a perceptible gap between textual intent and visual output. To bridge this gap, this work introduces a novel GAN architecture, the Multi-code Deep Feature Fusion Attention GAN (mDFA-GAN). Its primary objective is to fundamentally enhance the synthesis pipeline by simultaneously enriching the source of visual variation through multiple noise codes and deepening the semantic reasoning process via a multifaceted attention mechanism, thereby setting a new benchmark for detail accuracy and textual faithfulness in generated imagery. Methods This paper proposes the Multi-code Deep Feature Fusion Attention Generative Adversarial Network (mDFA-GAN). The architecture introduces three key innovations within its generator. First, a multi-noise input mechanism is employed, utilizing several independent noise vectors instead of a single one. This provides a richer latent foundation, with each noise potentially specializing in different visual aspects like structure, texture, or color. Second, to effectively integrate information from these multiple sources, a Multi-code Prior Fusion Module is designed. It operates on intermediate feature maps using learnable, adaptive channel-wise weights, performing a weighted summation to dynamically combine distinct visual concepts into a coherent, detail-enriched representation. Third, a Multi-head Attention Module is integrated into the generator's later stages. Unlike conventional single-head attention that often links a region to one word, this module computes attention between image features and word embeddings across multiple independent heads. This allows each image region to be informed by a context derived from several relevant words, enabling more nuanced and accurate semantic grounding. The model is trained with a unidirectional discriminator using a conditional hinge loss augmented with a Matching-Aware zero-centered Gradient Penalty (MA-GP) to stabilize training and enforce text-image matching. Furthermore, a dedicated multi-code fusion loss is introduced to minimize variance among features from different noise inputs, ensuring their spatial and semantic coherence for stable, high-quality synthesis. Results and Discussions The proposed mDFA-GAN is evaluated on the CUB-200-2011 and MS COCO datasets. Qualitative results, as shown in ( Fig. 7 ) and (Fig. 8 ), demonstrate its superior capability in generating images with accurate colors, fine-grained details (e.g., specific plumage patterns, object shapes), and coherent complex scenes. It successfully renders subtle textual attributes often missed by baselines. Quantitatively, mDFA-GAN achieves state-of-the-art performance. It obtains the highest Inception Score (IS) of 4.82 on CUB (Table 2 ), indicating better perceptual quality and alignment with text. Crucially, it achieves the lowest Fréchet Inception Distance (FID) scores of 13.45 on CUB and 16.50 on COCO (Table 3 ), outperforming established GANs like DF-GAN, AttnGAN, and DM-GAN. This confirms that its generated images are statistically closer to real images. Ablation studies provide clear evidence for each component's contribution. Removing either the fusion module or the attention module degrades performance (Table 4 ), validating their roles in enhancing detail and semantic consistency, respectively. The proposed multi-code fusion loss is also shown to be essential for training stability and output quality (Table 5 ). An analysis on the number of noise codes identifies three as optimal (Table 6 ). Regarding efficiency, mDFA-GAN maintains a fast inference speed of 0.8 seconds per image (Table 7 ), retaining the advantage of GANs over slower diffusion models while offering significantly improved quality.Conclusions In this paper, we propose mDFA-GAN, a novel and highly effective framework for text-to-image synthesis that makes significant strides in overcoming two persistent challenges: limited fine-grained detail and imperfect semantic alignment. By architecturally decoupling the latent representation into multiple specialized noise codes and fusing them adaptively, the model gains a superior capacity for generating intricate visual details. By employing a multi-head cross-modal attention mechanism, it achieves a deeper, more context-aware understanding of the text, leading to precise semantic grounding. Extensive experiments on established benchmarks demonstrate that mDFA-GAN sets a new state-of-the-art for GAN-based methods, as quantitatively validated by superior IS and FID scores and qualitatively evidenced by highly detailed and semantically faithful image samples. The comprehensive ablation analyses conclusively validate the necessity and synergy of each proposed component. This work not only presents a powerful model for text-to-image generation but also offers architectural insights that could inform future research in multimodal representation learning and detailed visual synthesis. -
Key words:
- Generative Adversarial Network /
- Text-to-Image /
- Cross-Modal /
- Multi-Code Prior Fusion
-
表 1 在CUB数据集上与前沿算法相比IS得分情况表
方法 Inception Score(IS)↑ StackGAN++ 4.04±0.06 AttnGAN 4.36±0.03 DM-GAN 4.47±0.19 DR-GAN 4.66±0.15 DF-GAN 4.61±0.12 SSA-GAN 4.70±0.08 Ours 4.82±0.10 表 2 在CUB数据集和COCO数据集上与前沿算法相比FID得分情况表
方法 CUB-FID↓ COCO-FID↓ StackGAN++ 15.30 81.59 AttnGAN 23.98 35.49 DM-GAN 16.09 32.64 DR-GAN 14.96 27.80 DF-GAN 14.81 19.32 SSA-GAN 15.61 19.37 Cogview - 27.10 LDM-8 - 23.31 Ours 13.45 16.50 表 3 模型组成部分的消融实验
组成部分 CUB-FID↓ COCO-FID↓ 基线模型 多码融合模块 多头注意力 √ - - 14.81 19.32 √ √ - 14.21 17.94 √ - √ 13.75 17.19 √ √ √ 13.45 16.50 表 4 噪声n的数量选择实验
FID n 1 2 3 4 5 CUB-FID↓ 14.21 13.67 13.45 13.65 13.70 COCO-FID↓ 17.94 16.95 16.50 16.86 16.90 表 5 推理速度对比
方法 DF-GAN Stable Diffusion DALLE2 Ours 推理速度 0.5 s 1.4 s 2 s 0.8 s -
[1] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139–144. doi: 10.1145/3422622. [2] TAO Ming, TANG Hao, WU Fei, et al. DF-GAN: A simple and effective baseline for text-to-image synthesis[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 16494–16504. doi: 10.1109/CVPR52688.2022.01602. [3] XU Tao, ZHANG Pengchuan, HUANG Qiuyuan, et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 1316–1324. doi: 10.1109/CVPR.2018.00143. [4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, USA, 2017: 6000–6010. [5] XUE A. End-to-end Chinese landscape painting creation using generative adversarial networks[C]. Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2021: 3862–3870. doi: 10.1109/WACV48630.2021.00391. [6] SHAHRIAR S. GAN computers generate arts? A survey on visual arts, music, and literary text generation using generative adversarial network[J]. Displays, 2022, 73: 102237. doi: 10.1016/j.displa.2022.102237. [7] ISOLA P, ZHU Junyan, ZHOU Tinghui, et al. Image-to-image translation with conditional adversarial networks[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 5967–5976. doi: 10.1109/CVPR.2017.632. [8] ALOTAIBI A. Deep generative adversarial networks for image-to-image translation: A review[J]. Symmetry, 2020, 12(10): 1705. doi: 10.3390/sym12101705. [9] XIA Weihao, YANG Yujiu, XUE Jinghao, et al. TEDIGAN: Text-guided diverse face image generation and manipulation[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 2256–2265. doi: 10.1109/CVPR46437.2021.00229. [10] KOCASARI U, DIRIK A, TIFTIKCI M, et al. StyleMC: Multi-channel based fast text-guided image generation and manipulation[C]. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2022: 3441–3450. doi: 10.1109/WACV51458.2022.00350. [11] SAHARIA C, CHAN W, CHANG H, et al. Photorealistic text-to-image diffusion models with deep language understanding[C]. IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023: 15679–15689. (查阅网上资料, 未找到本条文献信息, 请确认). [12] ZHANG Han, XU Tao, LI Hongsheng, et al. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks[C]. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5908–5916. doi: 10.1109/ICCV.2017.629. [13] ZHANG Han, XU Tao, LI Hongsheng, et al. StackGAN++: Realistic image synthesis with stacked generative adversarial networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1947–1962. doi: 10.1109/TPAMI.2018.2856256. [14] LIAO Wentong, HU Kai, YANG M Y, et al. Text to image generation with semantic-spatial aware GAN[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 18166–18175. doi: 10.1109/CVPR52688.2022.01765. [15] TAO Ming, BAO Bingkun, TANG Hao, et al. GALIP: Generative adversarial CLIPs for text-to-image synthesis[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 14214–14223. doi: 10.1109/CVPR52729.2023.01366. [16] LU Cheng, ZHOU Yuhao, BAO Fan, et al. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models[J]. Machine Intelligence Research, 2025, 22(4): 730–751. doi: 10.1007/s11633-025-1562-4. [17] DING Ming, YANG Zhuoyi, HONG Wenyi, et al. CogView: Mastering text-to-image generation via transformers[C]. Proceedings of the 35th Conference on Neural Information Processing Systems, 2021: 19822–19835. (查阅网上资料, 未找到本条文献出版地信息, 请确认). [18] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10674–10685. doi: 10.1109/CVPR52688.2022.01042. [19] ZHAO Liang, HUANG Pingda, CHEN Tengtuo, et al. Multi-sentence complementarily generation for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2024, 26: 8323–8332. doi: 10.1109/TMM.2023.3297769. [20] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, USA, 2019: 4171–4186. doi: 10.18653/v1/N19-1423. [21] LI Bowen, QI Xiaojuan, LUKASIEWICZ T, et al. Controllable text-to-image generation[C]. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, 2019: 185. [22] RUAN Shulan, ZHANG Yong, ZHANG Kun, et al. DAE-GAN: Dynamic aspect-aware GAN for text-to-image synthesis[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 13940–13949. doi: 10.1109/ICCV48922.2021.01370. [23] ZHANG L, ZHANG Y, LIU X, et al. Fine-grained text-to-image synthesis via semantic pyramid alignment[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023: 23415–23425. (查阅网上资料, 未找到本条文献信息, 请确认). [24] CHEN J, LIU Y, WANG H, et al. Improving text-image semantic consistency in generative adversarial networks via contrastive learning[J]. IEEE Transactions on Multimedia, 2024, 26: 5102–5113. doi: 10.1109/TMM.2024.3356781. (查阅网上资料,未找到本条文献信息,请确认). [25] DENG Zhijun, HE Xiangteng, and PENG Yuxin. LFR-GAN: Local feature refinement based generative adversarial network for text-to-image generation[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(5): 207. doi: 10.1145/358900. [26] YANG Bing, XIANG Xueqin, KONG Wangzeng, et al. DMF-GAN: Deep multimodal fusion generative adversarial networks for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2024, 26: 6956–6967. doi: 10.1109/TMM.2024.3358086. [27] WANG Z, ZHOU Y, SHI B, et al. Advances in controllable and disentangled representation learning for generative models[J]. International Journal of Computer Vision, 2023, 131(5): 1245–1263. doi: 10.1007/s11263-023-01785-y. (查阅网上资料,未找到本条文献信息,请确认). [28] YUAN M and PENG Y. T2I-CompBench: A comprehensive benchmark for open-world compositional text-to-image generation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(5): 2754–2769. doi: 10.1109/TPAMI.2023.3330805. (查阅网上资料,未找到本条文献信息,请确认). [29] SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs[C]. Proceedings of the 30th Conference on Neural Information Processing Systems, Barcelona, Spain, 2016: 2234–2242. [30] HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, USA, 2017: 6629–6640. [31] TAN Hongchen, LIU Xiuping, YIN Baocai, et al. DR-GAN: Distribution regularization for text-to-image generation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(12): 10309–10323. doi: 10.1109/TNNLS.2022.3165573. [32] WAH C, BRANSON S, WELINDER P, et al. The Caltech-UCSD birds-200–2011 dataset[R]. CNS-TR-2010-001, 2011. [33] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[C]. Proceedings of 13th European Conference on Computer Vision -- ECCV 2014, Zurich, Switzerland, 2014: 740–755. doi: 10.1007/978-3-319-10602-1_48. -
下载:
下载: