Evaluating State-of-the-Art Visual Question Answering Models Ability to Answer Complex Counting Questions

Authors

  • Krish Gangaraju The International School Bangalore
  • Khaled Jedoui Mentor, Stanford University

DOI:

https://doi.org/10.47611/jsrhs.v10i4.2446

Keywords:

CLEVR Dataset, Counting Questions, Computer Vison, MAC Model, Visual Question Answering

Abstract

Visual Question Answering (VQA) is a relatively newer area of computer science involving computer vision, natural language processing, and deep learning. It has the ability to answer questions (currently in English) related to particular images that it is shown. Since the original VQA dataset was made publicly available in 2014, we’ve seen datasets such as the OK-VQA, Visual7W, and CLEVR that have all explored new concepts, various algorithms exceeding previous benchmarks, and methods to evaluate these models. However, to the best of my research, I have not seen any math or word problems being integrated into any of the VQA datasets. In this paper, I incorporate the four basic mathematical operations into the ‘counting’ questions of the CLEVR dataset and compare how different models fair against this modified dataset of 100,00 images and 2.4 million questions. The models we used achieved circa 50% validation accuracy within 4 epochs showing room for improvement. If VQA models can assimilate mathematics into its question understanding ability, then this can open new pathways for the future.

Downloads

Download data is not yet available.

References or Bibliography

Bhattacharya, N.; Li, Q.; Gurari, D. Why Does a Visual Question Have Different Answers? In 2019 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE, 2019.

Zhang, P.; Goyal, Y.; Summers-Stay, D.; Batra, D.; Parikh, D. Yin and Yang: Balancing and Answering Binary Visual Questions. arXiv [cs.CL], 2015.

Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE, 2017.

Johnson, J.; Hariharan, B.; van der Maaten, L.; Fei-Fei, L.; Zitnick, C. L.; Girshick, R. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE, 2017.

Agrawal, A.; Batra, D.; Parikh, D. Analyzing the Behavior of Visual Question Answering Models. arXiv [cs.CL], 2016.

Agrawal, A.; Lu, J.; Antol, S.; Mitchell, M.; Zitnick, C. L.; Batra, D.; Parikh, D. VQA: Visual Question Answering. arXiv [cs.CL], 2015.

Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. arXiv [cs.CV], 2019.

Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; Bernstein, M. S.; Fei-Fei, L. Connecting language and vision using crowdsourced dense image annotations https://visualgenome.org/static/paper/Visual_Genome.pdf (accessed Oct 5, 2021).

Zhu, Y.; Groth, O.; Bernstein, M.; Fei-Fei, L. Visual7W: Grounded Question Answering in Images. arXiv [cs.CV], 2015.

Yu, L.; Park, E.; Berg, A. C.; Berg, T. L. Visual Madlibs: Fill in the Blank Image Generation and Question Answering. arXiv [cs.CV], 2015.

Du, T.; Cao, J.; Wu, Q.; Li, W.; Shen, B.; Chen, Y. CocoQa: Question Answering for Coding Conventions over Knowledge Graphs. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE); IEEE, 2019; pp 1086–1089.

Gao, H.; Mao, J.; Zhou, J.; Huang, Z.; Wang, L.; Xu, W. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering. arXiv [cs.CV], 2015.

Tapaswi, M.; Zhu, Y.; Stiefelhagen, R.; Torralba, A.; Urtasun, R.; Fidler, S. MovieQA: Understanding Stories in Movies through Question-Answering. arXiv [cs.CV], 2015.

Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and Tell: A Neural Image Caption Generator. arXiv [cs.CV], 2014.

Pietro Perona Deva Ramanan C. Lawrence Zitnick Piotr Dollar, T.-Y. L. M. M. S. B. L. B. R. G. J. H. Microsoft COCO: Common objects in context http://arxiv.org/abs/1405.0312v3 (accessed Oct 5, 2021).

Ren, M.; Kiros, R.; Zemel, R. Exploring Models and Data for Image Question Answering. arXiv [cs.LG], 2015.

Korchi, A. E.; Ghanou, Y. 2D Geometric Shapes Dataset - for Machine Learning and Pattern Recognition. Data Brief 2020, 32 (106090), 106090.

Kamath, A.; Singh, M.; LeCun, Y.; Misra, I.; Synnaeve, G.; Carion, N. MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding. arXiv [cs.CV], 2021.

Wang, Z.; Wang, K.; Yu, M.; Xiong, J.; Hwu, W.-M.; Hasegawa-Johnson, M.; Shi, H. Interpretable Visual Reasoning via Induced Symbolic Space. arXiv [cs.CV], 2020.

Hudson, D. A.; Manning, C. D. Compositional Attention Networks for Machine Reasoning. arXiv [cs.AI], 2018.

Yi, K.; Wu, J.; Gan, C.; Torralba, A.; Kohli, P.; Tenenbaum, J. B. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. arXiv [cs.AI], 2018.

Goyal, A. P. S. B. Are NLP Models really able to Solve Simple Math Word Problems? http://arxiv.org/abs/2103.07191v2 (accessed Oct 6, 2021).

Mac-Network: Implementation for the Paper “Compositional Attention Networks for Machine Reasoning” (Hudson and Manning, ICLR 2018).

Published

11-30-2021

How to Cite

Gangaraju, K., & Jedoui, K. . (2021). Evaluating State-of-the-Art Visual Question Answering Models Ability to Answer Complex Counting Questions. Journal of Student Research, 10(4). https://doi.org/10.47611/jsrhs.v10i4.2446

Issue

Section

HS Research Articles