Empirical approach to understanding natural language models
DOI:
https://doi.org/10.47611/jsrhs.v13i1.6103Keywords:
NLP Models, Latent Dirichlet Allocation, Categorization, Performance Analysis, Model EvaluationAbstract
In this paper, we try to understand natural language models by treating them as black boxes. We want to learn about these models without going into their technical details pertaining to network architecture, tuning parameters, training datasets, and schedules. We instead take an empirical approach, where we classify the datasets into various categories. For scalability and avoiding subjective bias, we use Latent Dirichlet Allocation (LDA) to categorize language text. We fine-tune and evaluate natural language models for our tasks. We compare the performance of the same model across multiple categories and for the same category across multiple models. This can help not only in choosing models for the desired categories but is also useful in understanding the model attributes that can explain performance variation. We report here the observations from this empirical study and our hypotheses. We find that models do not perform uniformly across all the categories, which could be because of uneven representation of these categories in their training datasets. Models that specialized/fine-tuned for specific tasks had higher variance in performance across categories than the generic models. Some categories have high performance consistently across all models, while others have high variance. The code for this research paper is available here: https://github.com/bhuvishi/llm_understanding
Downloads
References or Bibliography
Alsentzer, E. (2019, April 6). Publicly available clinical BERT embeddings. arXiv.org. https://arxiv.org/abs/1904.03323
Balestriero, R. (2021, October 18). Learning in high dimension always amounts to extrapolation. arXiv.org. https://arxiv.org/abs/2110.09485
Caselli, T., Basile, V., Mitrović, J., & Granitzer, M. (2021). HateBERT: Retraining BERT for Abusive Language Detection in English. Association for Computational Linguistics, 2021, 17–25. https://doi.org/10.18653/v1/2021.woah-1.3
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Αλέτρας, Ν., & Androutsopoulos, I. (2020). LEGAL-BERT: The Muppets straight out of Law School. Association for Computational Linguistics, 2020, 2898–2904. https://doi.org/10.18653/v1/2020.findings-emnlp.261
Devlin, J. (2018, October 11). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv.org. https://arxiv.org/abs/1810.04805
Grezes, F. (2021, December 1). Building astroBERT, a language model for Astronomy & Astrophysics. arXiv.org. https://arxiv.org/abs/2112.00590
Hazourli, A. R. (2022). FinancialBERT - A Pretrained Language Model for Financial Text Mining. ResearchGate. https://doi.org/10.13140/RG.2.2.34032.12803
Hoffman, M. (2010). Online learning for latent dirichlet allocation. https://papers.nips.cc/paper_files/paper/2010/hash/71f6278d140af599e06ad9bf1ba03cb0-Abstract.html
Liu, Y. (2019, July 26). Roberta: A robustly optimized BERT pretraining approach. arXiv.org. https://arxiv.org/abs/1907.11692
Loukas, L., Fergadiotis, M., Chalkidis, I., Spyropoulou, E., Malakasiotis, P., Androutsopoulos, I., & Παλιούρας, Γ. (2022). FINER: Financial Numeric Entity Recognition for XBRL Tagging. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/2022.acl-long.303
Lu, Y. (2022). Imitation Is Not Enough: Robustifying Imitation with Reinforcement Learning for Challenging Driving Scenarios. arXiv.org. https://arxiv.org/abs/2212.11419
Rajpurkar, P. (2018, June 11). Know what you don’t know: Unanswerable questions for SQUAD. arXiv.org. https://arxiv.org/abs/1806.03822
Rajpurkar, P. (2016, June 16). SQUAD: 100,000+ questions for machine comprehension of text. arXiv.org. https://arxiv.org/abs/1606.05250
Sanh, V. (2019, October 2). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv.org. https://arxiv.org/abs/1910.01108
Published
How to Cite
Issue
Section
Copyright (c) 2024 Bhuvishi Bansal
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright holder(s) granted JSR a perpetual, non-exclusive license to distriute & display this article.