Twitter Sentiment Analysis Using Machine Learning
DOI:
https://doi.org/10.47611/jsrhs.v13i2.6819Keywords:
artificial intelligence, Twitter, Sentiment analysis, natural language processing, machine learning, sklearnAbstract
In an age of social media, online forums, and chats, cyberbullying is a prevalent issue. On Twitter (now X), approximately 500 million tweets are shared per day (Antonakaki et.al., 2021). It is the job of the moderators to ensure these tweets follow standard community guidelines. However, the sheer number of tweets makes it difficult to sort manually and ensure they are following protocol. Sentiment analysis and machine learning algorithms can be used to classify these texts automatically as positive or negative. Normally, these machine learning models are much more efficient and may provide higher accuracy rates in identifying hate speech in Twitter. In this paper, we are exploring the use of five classical machine learning algorithms to classify Twitter hate speech as neutral, racist, or sexist. Model performance was compared after using raw tweet data versus pre-processed tweets through data cleanup. Furthermore, we highlight two methods to deal with imbalanced datasets to improve the prediction rates. Overall, we were able to achieve a 96% accuracy in correctly classifying tweets into the different labels.
Downloads
References or Bibliography
Antonakaki, D., Fragopoulou, P., & Ioannidis, S. (2021). A survey of Twitter research: Data model, graph structure, sentiment analysis and attacks. Expert Systems with Applications, 164, 114006. https://doi.org/10.1016/j.eswa.2020.114006
Giachanou, A., & Crestani, F. (2016). Like It or Not. ACM Computing Surveys, 49(2), 1–41. https://doi.org/10.1145/2938640
1. Linear Models — scikit-learn 0.22.2 documentation. (n.d.). Scikit-Learn.org. https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
6. Nearest Neighbors — scikit-learn 0.21.3 documentation. (2019). Scikit-Learn.org. https://scikit-learn.org/stable/modules/neighbors.html
Scikit-learn. (2019). 1.9. Naive Bayes — scikit-learn 0.21.3 documentation. Scikit-Learn.org. https://scikit-learn.org/stable/modules/naive_bayes.html
11. Ensemble methods. (n.d.). Scikit-Learn. https://scikit-learn.org/stable/modules/ensemble.html#random-forests
Google Developers. (2019, March 5). Classification: Precision and Recall - Google Developers.
https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
Nothman, J., Qin, H., & Yurchak, R. (2018). Stop Word Lists in Free Open-source Software Packages. Proceedings of Workshop for NLP Open Source Software (NLP-OSS). https://doi.org/10.18653/v1/w18-2502
Willett, P. (2006). The Porter stemming algorithm: then and now. Program, 40(3), 219–223. https://doi.org/10.1108/00330330610681295
Khyani, Divya & B S, Siddhartha. (2021). An Interpretation of Lemmatization and Stemming in Natural Language Processing. Shanghai Ligong Daxue Xuebao. Journal of University of Shanghai for Science and Technology. 22. 350-357.
sklearn.feature_extraction.text.TfidfTransformer — scikit-learn 0.23.1 documentation. (n.d.). Scikit-Learn.org. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
Vargas, V., Aranda, J., Costa, R., Pereira, P., & Luis, J. (2022). Imbalanced data preprocessing techniques for machine learning: a systematic mapping study. 65(1), 31–57. https://doi.org/10.1007/s10115-022-01772-8
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Müller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., Vanderplas, J., Joly, A., Holt, B., & Varoquaux, G. (2013). API design for machine learning software: experiences from the scikit-learn project.
Published
How to Cite
Issue
Section
Copyright (c) 2024 Srimayi Gupta; Padmavathy Jawahar
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright holder(s) granted JSR a perpetual, non-exclusive license to distriute & display this article.