CRD-CGAN: category-consistent and relativistic constraints for diverse text-to-image generation

Tao HU; Chengjiang LONG; Chunxia XIAO

doi:10.1007/s11704-022-2385-x

PDF(19172 KB)

Front. Comput. Sci. ›› 2024, Vol. 18 ›› Issue (1) : 181304. DOI: 10.1007/s11704-022-2385-x

Artificial Intelligence

RESEARCH ARTICLE

CRD-CGAN: category-consistent and relativistic constraints for diverse text-to-image generation

Tao HU¹^,²^,³ ,
Chengjiang LONG⁴ ,
Chunxia XIAO²

Author information +

History +

Abstract

Generating photo-realistic images from a text description is a challenging problem in computer vision. Previous works have shown promising performance to generate synthetic images conditional on text by Generative Adversarial Networks (GANs). In this paper, we focus on the category-consistent and relativistic diverse constraints to optimize the diversity of synthetic images. Based on those constraints, a category-consistent and relativistic diverse conditional GAN (CRD-CGAN) is proposed to synthesize K photo-realistic images simultaneously. We use the attention loss and diversity loss to improve the sensitivity of the GAN to word attention and noises. Then, we employ the relativistic conditional loss to estimate the probability of relatively real or fake for synthetic images, which can improve the performance of basic conditional loss. Finally, we introduce a category-consistent loss to alleviate the over-category issues between K synthetic images. We evaluate our approach using the Caltech-UCSD Birds-200-2011, Oxford 102 flower and MS COCO 2014 datasets, and the extensive experiments demonstrate superiority of the proposed method in comparison with state-of-the-art methods in terms of photorealistic and diversity of the generated synthetic images.

Graphical abstract

Keywords

text-to-image / diverse conditional GAN / relativistic category-consistent

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Tao HU, Chengjiang LONG, Chunxia XIAO. CRD-CGAN: category-consistent and relativistic constraints for diverse text-to-image generation. Front. Comput. Sci., 2024, 18(1): 181304 https://doi.org/10.1007/s11704-022-2385-x

This is a preview of subscription content, contact us for subscripton.

Tao Hu received his PhD degree in computer science from School of Computer Science, Wuhan University, China in 2020. His current research interests include deep learning, and image processing

Chengjiang Long received his PhD degree in Computer Science from Stevens Institute of Technology, USA in 2015. His research interests involve various areas of Computer Vision, computer graphics, machine learning, and Artificial Intelligence

Chunxia Xiao is currently a professor at the School of Computer Science, Wuhan University, China. He received his PhD from the State Key Lab of CAD & CG of Zhejiang University, China in 2006. His research areas include Computer graphics, computer vision, image processing, Virtual reality and Augmented reality. He has published more than 120 papers in journals and conferences

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Hu T, Long C, Xiao C . A novel visual representation on text using diverse conditional GAN for visual recognition. IEEE Transactions on Image Processing, 2021, 30: 3499–3512

[2]	Long C, Collins R, Swears E, Hoogs A. Deep neural networks in fully connected CRF for image labeling with social network metadata. In: Proceedings of 2019 IEEE Winter Conference on Applications of Computer Vision. 2019, 1607–1615

[3]	Long C, Hua G, Kapoor A. Active visual recognition with expertise estimation in crowdsourcing. In: Proceedings of 2013 IEEE International Conference on Computer Vision. 2013, 3000–3007

[4]	Hua G, Long C, Yang M, Gao Y. Collaborative active learning of a kernel machine ensemble for recognition. In: Proceedings of 2013 IEEE International Conference on Computer Vision. 2013, 1209–1216

[5]	Long C, Hua G. Multi-class multi-annotator active learning with robust gaussian process for visual recognition. In: Proceedings of 2015 IEEE International Conference on Computer Vision. 2015, 2839–2847

[6]	Long C, Hua G, Kapoor A . A joint gaussian process model for active visual recognition with expertise estimation in crowdsourcing. International Journal of Computer Vision, 2016, 116( 2): 136–160

[7]	Long C, Hua G. Correlational Gaussian processes for cross-domain visual recognition. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 118–126

[8]	Hua G, Long C, Yang M, . . Collaborative active visual recognition from crowds: a distributed ensemble approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40( 3): 582–594

[9]	Wang Y, Wei Y, Qian X, Zhu L, Yang Y . Sketch-guided scenery image outpainting. IEEE Transactions on Image Processing, 2021, 30: 2643–2655

[10]	Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y . Generative adversarial nets. Communications of the ACM, 2020, 63( 11): 139–144

[11]	Mirza M, Osindero S. Conditional generative adversarial nets. 2014, arXiv preprint arXiv: 1411.1784

[12]	Reed S E, Akata Z, Mohan S, Tenka S, Schiele B, Lee H. Learning what and where to draw. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 217–225

[13]	Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H. Generative adversarial text to image synthesis. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning. 2016, 1060–1069

[14]	Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z, Shi W. Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4681–4690

[15]	Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas D. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 5907–5915

[16]	Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas D N . StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41( 8): 1947–1962

[17]	Zhang H, Goodfellow I J, Metaxas D N, Odena A. Self-attention generative adversarial networks. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 7354–7363

[18]	Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 1316–1324

[19]	Mao Q, Lee H Y, Tseng H Y, Ma S, Yang M S. Mode seeking generative adversarial networks for diverse image synthesis. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 1429–1437

[20]	Yin G, Liu B, Sheng L, Yu N, Wang X, Shao J. Semantics disentangling for text-to-image generation. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2327–2336

[21]	Cha M, Gwon Y L, Kung H T. Adversarial learning of semantic relevance in text to image synthesis. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 3272–3279

[22]	Tan F, Feng S, Ordonez V. Text2Scene: generating compositional scenes from textual descriptions. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 6710–6719

[23]	Li Y, Gan Z, Shen Y, Liu J, Cheng Y, Wu Y, Carin L, Carlson D, Gao J. StoryGAN: a sequential conditional GAN for story visualization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 6322–6331

[24]	Li W, Zhang P, Zhang L, Huang Q, He X, Lyu S, Gao J. Object-driven text-to-image synthesis via adversarial training. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 12166–12174

[25]	Eghbal-Zadeh H, Zellinger W, Widmer G. Mixture density generative adversarial networks. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 5813–5822

[26]	Cheng J, Wu F, Tian Y, Wang L, Tao D. RiFeGAN: rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 10908–10917

[27]	Liang J, Pei W, Lu F. CPGAN: content-parsing generative adversarial networks for text-to-image synthesis. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 491–508

[28]	Koh J Y, Baldridge J, Lee H, Yang Y. Text-to-image generation grounded by fine-grained user attention. In: Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. 2021, 237–246

[29]	Gao L, Chen D, Zhao Z, Shao J, Shen H T . Lightweight dynamic conditional GAN with pyramid attention for text-to-image synthesis. Pattern Recognition, 2021, 110: 107384

[30]	Yang Y, Wang L, Xie D, Deng C, Tao D . Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis. IEEE Transactions on Image Processing, 2021, 30: 2798–2809

[31]	Arroyo D M, Postels J, Tombari F. Variational transformer networks for layout generation. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 13637–13647

[32]	Fang F, Li Z, Luo F, Xiao C. Discriminator modification in GAN for text-to-image generation. In: Proceedings of 2022 IEEE International Conference on Multimedia and Expo. 2022, 1–6

[33]	Fang F, Li Z, Luo F, Long C, Hu S, Xiao C. PhraseGAN: phrase-boost generative adversarial network for text-to-image generation. In: Proceedings of 2022 IEEE International Conference on Multimedia and Expo. 2022, 1–6

[34]	Park T, Liu M Y, Wang T C, Zhu J Y. Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2332–2341

[35]	Hu M, Li J, Hu M, Hu T. Hierarchical modes exploring in generative adversarial networks. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 10981–10988

[36]	Liu Z, Wang J, Liang Z. CatGAN: category-aware generative adversarial networks with hierarchical evolutionary learning for category text generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 8425–8432

[37]	Huang X, Li Y, Poursaeed O, Hopcroft J, Belongie S. Stacked generative adversarial networks. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1866–1875

[38]	Wah C, Branson S, Welinder P, Perona P, Belongie S. The caltech-UCSD birds-200-2011 dataset. California Institute of Technology. CNS-TR-2010-001. 2011

[39]	Nilsback M E, Zisserman A. Automated flower classification over a large number of classes. In: Proceedings of the 6th Indian Conference on Computer Vision, Graphics & Image Processing. 2008, 722–729

[40]	Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: common objects in context. In: Proceedings of the 13th European Conference on Computer Vision. 2014, 740–755

[41]	Ding B, Long C, Zhang L, Xiao C. ARGAN: attentive recurrent generative adversarial network for shadow detection and removal. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 10212–10221

[42]	Zhang L, Long C, Zhang X, Xiao C. RIS-GAN: explore residual and illumination with generative adversarial networks for shadow removal. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 12829–12836

[43]	Liu D, Long C, Zhang H, Yu H, Dong X, Xiao C. ARShadowGAN: shadow generative adversarial network for augmented reality in single light scenes. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 8136–8145

[44]	Islam A, Long C, Basharat A, Hoogs A. DOA-GAN: dual-order attentive generative adversarial network for image copy-move forgery detection and localization. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 4675–4684

[45]	Zhang L, Long C, Yan Q, Zhang X, Xiao C . CLA-GAN: a context and lightness aware generative adversarial network for shadow removal. Computer Graphics Forum, 2020, 39( 7): 483–494

[46]	Zhang J, Long C, Wang Y, Yang X, Mei H, Yin B. Multi-context and enhanced reconstruction network for single image super resolution. In: Proceedings of 2020 IEEE International Conference on Multimedia and Expo. 2020, 1–6

[47]	Vasu B, Long C. Iterative and adaptive sampling with spatial attention for black-box model explanations. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision. 2020, 2949–2958

[48]	Zhang J, Long C, Wang Y, Piao H, Mei H, Yang X, Yin B . A two-stage attentive network for single image super-resolution. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32( 3): 1020–1033

[49]	Islam A, Long C, Radke R. A hybrid attention mechanism for weakly-supervised temporal action localization. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 1637–1645

[50]	Wei J, Long C, Zou H, Xiao C . Shadow inpainting and removal using generative adversarial networks with slice convolutions. Computer Graphics Forum, 2019, 38( 7): 381–392

[51]	Yang Z, Dong J, Liu P, Yang Y, Yan S. Very long natural scenery image prediction by outpainting. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 10560–10569

[52]	Zheng Z, Zheng L, Yang Y. Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 3774–3782

[53]	Zheng Z, Yang X, Yu Z, Zheng L, Yang Y, Kautz J. Joint discriminative and generative learning for person re-identification. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2133–2142

[54]	Wang X, Zhu L, Zheng Z, Xu M, Yang Y. Align and tell: boosting text-video retrieval with local alignment and fine-grained supervision. IEEE Transactions on Multimedia, 2022, 1-11

[55]	Shrivastava A, Pfister T, Tuzel O, Susskind J, Wang W, Webb R. Learning from simulated and unsupervised images through adversarial training. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2242–2251

[56]	Shi J, Zhong Y, Xu N, Li Y, Xu C. A simple baseline for weakly-supervised scene graph generation. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 16373–16382

[57]	Zhang H, Koh J Y, Baldridge J, Lee H, Yang Y. Cross-modal contrastive learning for text-to-image generation. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 833–842

[58]	Arjovsky M, Bottou L . Towards principled methods for training generative adversarial networks. 2017, arXiv preprint arXiv: 1701, 0486, 2

[59]	Jolicoeur-Martineau A. The relativistic discriminator: a key element missing from standard GAN. In: Proceedings of the 7th International Conference on Learning Representations. 2019

[60]	Jolicoeur-Martineau A. On relativistic f-divergences. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 458

[61]	Mao X, Li Q, Xie H, Lau R Y K, Wang Z, Smolley S P. Least squares generative adversarial networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 2813–2821

[62]	Krizhevsky A, Sutskever I, Hinton G E . ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60( 6): 84–90

[63]	Pattnaik S, Nayak A K. Summarization of odia text document using cosine similarity and clustering. In: Proceedings of 2019 International Conference on Applied Machine Learning. 2019, 143–146

[64]	Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6629–6640

[65]	Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X. Improved techniques for training GANs. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2234–2242

[66]	Zhang R, Isola P, Efros A A, Shechtman E, Wang O. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 586–595

[67]	Zhang Z, Xie Y, Yang L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6199–6208

[68]	Souza D M, Wehrmann J, Ruiz D D. Efficient neural architecture for text-to-image synthesis. In: Proceedings of 2020 International Joint Conference on Neural Networks. 2020, 1–8

[69]	Nguyen A, Clune J, Bengio Y, Dosovitskiy A, Yosinski J. Plug & play generative networks: conditional iterative generation of images in latent space. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3510–3520

Acknowledgements

This work was co-supervised by Chengjiang Long and Chunxia Xiao, and supported by the National Natural Science Foundation of China (Grant Nos. 61972298 and 61962019), and by the National Cultural and Tourism Science and Technology Innovation Project (2021064), and the Training Program of High Level Scientific Research Achievements of Hubei Minzu University under Grant PY22011.