A Picture is Worth a Thousand Words: Using Cross-Modal Transformers and Variational AutoEncoders to Generate Images from Text
DOI:
https://doi.org/10.47611/jsrhs.v10i4.2106Keywords:
Transformer, VAE, Text-To-Image, multi-modalAbstract
Text-to-image generation is one of the most complex problems in deep learning, where the application of Recurrent Neural Networks (RNNs) and Generative Adversarial Networks (GANs) has seen significant success. However, GANs prioritize the sharpness of the image rather than covering all the nuances of the text. Given that Transformers have recently outperformed RNNs and other neural network models in both the text and image spaces, we explored whether Transformer models can perform better in multi-modal tasks such as text-to-image synthesis. Our conclusion based on evaluating five Transformer based models on the MS-COCO dataset showed that Transformers perform better but would need a significant amount of memory and compute resources.
Downloads
References or Bibliography
Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-Memory Transformer for Image Captioning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10575-10584. https://doi.org/10.1109/CVPR42600.2020.01059
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. https://doi.org/10.18653/v1%2FN19-1423
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., & Bengio, Y. (2014). Generative Adversarial Networks. ArXiv, abs/1406.2661. https://arxiv.org/abs/1406.2661
Kingma, D.P., & Welling, M. (2014). Auto-Encoding Variational Bayes. CoRR, abs/1312.6114. https://arxiv.org/abs/1312.6114
Kurach, K., Lucic, M., Zhai, X., Michalski, M., & Gelly, S. (2019). A Large-Scale Study on Regularization and Normalization in GANs. ICML. http://proceedings.mlr.press/v97/kurach19a.html
Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO: Common Objects in Context. ECCV. https://arxiv.org/abs/1405.0312
Oord, A.V., Vinyals, O., & Kavukcuoglu, K. (2017). Neural Discrete Representation Learning. NIPS. https://arxiv.org/abs/1711.00937
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N.M., Ku, A., & Tran, D. (2018). Image Transformer. ArXiv, abs/1802.05751. https://arxiv.org/abs/1802.05751
Qiao, T., Zhang, J., Xu, D., & Tao, D. (2019). MirrorGAN: Learning Text-To-Image Generation by Redescription. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1505-1514. https://doi.org/10.1109/CVPR.2019.00160
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-Shot Text-to-Image Generation. ArXiv, abs/2102.12092. https://arxiv.org/abs/2102.12092
Razavi, A., Oord, A.V., & Vinyals, O. (2019). Generating Diverse High-Fidelity Images with VQ-VAE-2. ArXiv, abs/1906.00446. https://arxiv.org/abs/1906.00446
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. ArXiv, abs/1908.10084. https://doi.org/10.18653/v1%2FD19-1410
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved Techniques for Training GANs. NIPS. https://arxiv.org/abs/1606.03498
Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Trans. Signal Process., 45, 2673-2681. https://ieeexplore.ieee.org/document/650093
Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. ArXiv, abs/1706.03762. https://arxiv.org/abs/1706.03762
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., & He, X. (2018). AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316-1324. https://doi.org/10.1109/CVPR.2018.00143
Published
How to Cite
Issue
Section
Copyright (c) 2021 Satyajit Kumar; Ehsan Adeli
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright holder(s) granted JSR a perpetual, non-exclusive license to distriute & display this article.