A Picture is Worth a Thousand Words: Using Cross-Modal Transformers and Variational AutoEncoders to Generate Images from Text

Authors

  • Satyajit Kumar Portola High School
  • Ehsan Adeli Stanford University

DOI:

https://doi.org/10.47611/jsrhs.v10i4.2106

Keywords:

Transformer, VAE, Text-To-Image, multi-modal

Abstract

Text-to-image generation is one of the most complex problems in deep learning, where the application of Recurrent Neural Networks (RNNs) and Generative Adversarial Networks (GANs) has seen significant success. However, GANs prioritize the sharpness of the image rather than covering all the nuances of the text. Given that Transformers have recently outperformed RNNs and other neural network models in both the text and image spaces, we explored whether Transformer models can perform better in multi-modal tasks such as text-to-image synthesis. Our conclusion based on evaluating five Transformer based models on the MS-COCO dataset showed that Transformers perform better but would need a significant amount of memory and compute resources.

Downloads

Download data is not yet available.

Author Biography

Ehsan Adeli, Stanford University

Advisor

References or Bibliography

Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-Memory Transformer for Image Captioning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10575-10584. https://doi.org/10.1109/CVPR42600.2020.01059

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. https://doi.org/10.18653/v1%2FN19-1423

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., & Bengio, Y. (2014). Generative Adversarial Networks. ArXiv, abs/1406.2661. https://arxiv.org/abs/1406.2661

Kingma, D.P., & Welling, M. (2014). Auto-Encoding Variational Bayes. CoRR, abs/1312.6114. https://arxiv.org/abs/1312.6114

Kurach, K., Lucic, M., Zhai, X., Michalski, M., & Gelly, S. (2019). A Large-Scale Study on Regularization and Normalization in GANs. ICML. http://proceedings.mlr.press/v97/kurach19a.html

Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO: Common Objects in Context. ECCV. https://arxiv.org/abs/1405.0312

Oord, A.V., Vinyals, O., & Kavukcuoglu, K. (2017). Neural Discrete Representation Learning. NIPS. https://arxiv.org/abs/1711.00937

Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N.M., Ku, A., & Tran, D. (2018). Image Transformer. ArXiv, abs/1802.05751. https://arxiv.org/abs/1802.05751

Qiao, T., Zhang, J., Xu, D., & Tao, D. (2019). MirrorGAN: Learning Text-To-Image Generation by Redescription. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1505-1514. https://doi.org/10.1109/CVPR.2019.00160

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-Shot Text-to-Image Generation. ArXiv, abs/2102.12092. https://arxiv.org/abs/2102.12092

Razavi, A., Oord, A.V., & Vinyals, O. (2019). Generating Diverse High-Fidelity Images with VQ-VAE-2. ArXiv, abs/1906.00446. https://arxiv.org/abs/1906.00446

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. ArXiv, abs/1908.10084. https://doi.org/10.18653/v1%2FD19-1410

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved Techniques for Training GANs. NIPS. https://arxiv.org/abs/1606.03498

Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Trans. Signal Process., 45, 2673-2681. https://ieeexplore.ieee.org/document/650093

Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. ArXiv, abs/1706.03762. https://arxiv.org/abs/1706.03762

Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., & He, X. (2018). AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316-1324. https://doi.org/10.1109/CVPR.2018.00143

Published

11-30-2021

How to Cite

Kumar, S., & Adeli, E. (2021). A Picture is Worth a Thousand Words: Using Cross-Modal Transformers and Variational AutoEncoders to Generate Images from Text. Journal of Student Research, 10(4). https://doi.org/10.47611/jsrhs.v10i4.2106

Issue

Section

HS Research Articles