A Picture is Worth a Thousand Words: Using Cross-Modal Transformers and Variational AutoEncoders to Generate Images from Text

Satyajit Kumar; Ehsan Adeli

doi:10.47611/jsrhs.v10i4.2106

Authors

Satyajit Kumar Portola High School
Ehsan Adeli Stanford University

DOI:

https://doi.org/10.47611/jsrhs.v10i4.2106

Keywords:

Transformer, VAE, Text-To-Image, multi-modal

PDF

Abstract

Text-to-image generation is one of the most complex problems in deep learning, where the application of Recurrent Neural Networks (RNNs) and Generative Adversarial Networks (GANs) has seen significant success. However, GANs prioritize the sharpness of the image rather than covering all the nuances of the text. Given that Transformers have recently outperformed RNNs and other neural network models in both the text and image spaces, we explored whether Transformer models can perform better in multi-modal tasks such as text-to-image synthesis. Our conclusion based on evaluating five Transformer based models on the MS-COCO dataset showed that Transformers perform better but would need a significant amount of memory and compute resources.

Downloads

Author Biography

Ehsan Adeli, Stanford University

Advisor

References or Bibliography

Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-Memory Transformer for Image Captioning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10575-10584. https://doi.org/10.1109/CVPR42600.2020.01059

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. https://doi.org/10.18653/v1%2FN19-1423

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., & Bengio, Y. (2014). Generative Adversarial Networks. ArXiv, abs/1406.2661. https://arxiv.org/abs/1406.2661

Kingma, D.P., & Welling, M. (2014). Auto-Encoding Variational Bayes. CoRR, abs/1312.6114. https://arxiv.org/abs/1312.6114

Kurach, K., Lucic, M., Zhai, X., Michalski, M., & Gelly, S. (2019). A Large-Scale Study on Regularization and Normalization in GANs. ICML. http://proceedings.mlr.press/v97/kurach19a.html

Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO: Common Objects in Context. ECCV. https://arxiv.org/abs/1405.0312

Oord, A.V., Vinyals, O., & Kavukcuoglu, K. (2017). Neural Discrete Representation Learning. NIPS. https://arxiv.org/abs/1711.00937

Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N.M., Ku, A., & Tran, D. (2018). Image Transformer. ArXiv, abs/1802.05751. https://arxiv.org/abs/1802.05751

Qiao, T., Zhang, J., Xu, D., & Tao, D. (2019). MirrorGAN: Learning Text-To-Image Generation by Redescription. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1505-1514. https://doi.org/10.1109/CVPR.2019.00160

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-Shot Text-to-Image Generation. ArXiv, abs/2102.12092. https://arxiv.org/abs/2102.12092

Razavi, A., Oord, A.V., & Vinyals, O. (2019). Generating Diverse High-Fidelity Images with VQ-VAE-2. ArXiv, abs/1906.00446. https://arxiv.org/abs/1906.00446

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. ArXiv, abs/1908.10084. https://doi.org/10.18653/v1%2FD19-1410

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved Techniques for Training GANs. NIPS. https://arxiv.org/abs/1606.03498

Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Trans. Signal Process., 45, 2673-2681. https://ieeexplore.ieee.org/document/650093

Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. ArXiv, abs/1706.03762. https://arxiv.org/abs/1706.03762

Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., & He, X. (2018). AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316-1324. https://doi.org/10.1109/CVPR.2018.00143

A Picture is Worth a Thousand Words: Using Cross-Modal Transformers and Variational AutoEncoders to Generate Images from Text

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biography

Ehsan Adeli, Stanford University

References or Bibliography

Published

How to Cite

Issue

Section

Announcements

Call for Papers: Volume 14 Issue 3

ARTICLES
PUBLISHED

STUDENT
AUTHORS

YEARS
OF SERVICE

A Picture is Worth a Thousand Words: Using Cross-Modal Transformers and Variational AutoEncoders to Generate Images from Text

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biography

Ehsan Adeli, Stanford University

References or Bibliography

Published

How to Cite

Issue

Section

Announcements

Call for Papers: Volume 14 Issue 3

ARTICLESPUBLISHED

STUDENTAUTHORS

YEARSOF SERVICE

ARTICLES
PUBLISHED

STUDENT
AUTHORS

YEARS
OF SERVICE