Movie Review Sentiment Analysis: Supervised Learning versus Large Language Model
DOI:
https://doi.org/10.47611/jsrhs.v13i1.6161Keywords:
sentiment analysis, movie reviews, supervised learning, large language modelsAbstract
Sentiment analysis is frequently used to derive insights from natural language. Examples include analysis of textual data to measure brand perception, social media trends, or customer opinion about products. This paper evaluates the performance of three supervised machine learning methods and compares them with the next-generation large language model (LLM), which recently gained popularity with the release of OpenAI ChatGPT. Specifically, we explore the application of Decision Tree, Random Forest, and Support Vector Machine classifiers to a representative sample of 100K movie reviews collected by a well-known website, IMDb.com. Reviews are tagged with numeric ratings, allowing the formulation of a supervised learning problem and exploring the ability to differentiate sentiment between strongly opinionated positive and negative reviews and also, a more challenging problem of differentiating between weakly opinionated positive and negative reviews. Models are tuned to optimize recall and precision in this application, achieving an accuracy score of 0.89 for strong reviews and 0.63 for weak reviews. We then compare the results with ChatGPT, without specialized training, which reaches a perfect accuracy score of 1.00 for strongly opinionated reviews and 0.75 for weakly opinionated reviews, concluding that it outperforms supervised learning approaches but is also imperfect in distinguishing more subtle sentiment in weakly opinionated reviews.
Downloads
References or Bibliography
Introduction to Large Language Models (2023).
https://developers.google.com/machine-learning/resources/intro-llms
OpenAI (2023).
Classification: Precision and Recall (2023).
https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
Kaggle Data Science Platform (2023).
IMDb Review Dataset – ebD (2023).
https://www.kaggle.com/datasets/ebiswas/imdb-review-dataset/
IMDb.com (2023).
NLTK Natural Language Processing Tookit (2023).
TfidfTransformer (2023).
Scikit-learn Machine Learning in Python (2023).
Random Forest Classifier (2023). https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Suport Vector Machines (2023).
https://scikit-learn.org/stable/modules/svm.html
LinearSVC Classifier (2023).
https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
OpenAI Python API (2023).
https://github.com/openai/openai-python
Jupyter (2023).
Published
How to Cite
Issue
Section
Copyright (c) 2024 Natalia Kochut
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright holder(s) granted JSR a perpetual, non-exclusive license to distriute & display this article.