Dataset for identification of queerphobia
DOI:
https://doi.org/10.47611/jsrhs.v12i1.4405Keywords:
queerphobia, hate speech, dataset, natural language processing, NLP, queer, homophobia, transphobia, sentiment analysis, machine learning, supervised learning algorithmsAbstract
While social media platforms have implemented many algorithmic approaches to moderating hate speech, there is a lack of datasets on queerphobia which has impeded efforts to automatically recognize and moderate queerphobic hate speech online. Queerphobic hate speech is speech that is intended to degrade, insult, or incite violence or prejudicial action against queer people, who are those from a sexuality, gender, or romantic minority. This speech results in worsened mental and emotional outcomes for queer people and can contribute to anti-queer violence. The goal of this study is to create a dataset of queerphobic YouTube comments to further efforts to identify and moderate queerphobic hate speech. To construct this dataset, 10,000 comments were sourced from YouTube videos which represent queerness. Then, volunteers manually annotated each comment in accordance with specific guidelines. Various natural language processing (NLP) models were used to extract features from the text, and several classifiers used these features to categorize comments as queerphobic or non-queerphobic. These NLP models illustrate a baseline for performance on this data. In making this dataset, we hope to further research in the recognition of digital queerphobia and make social media platforms safer for queer people. The dataset can be found at https://github.com/ShivumB/dataset-for-identification-of-queerphobia.
Downloads
References or Bibliography
Banerjee, S. (2023). Dataset for identification of queerphobia. Github. https://github.com/ShivumB/Dataset-for-Identification-of-Queerphobia
Bird, S. (2006). NLTK: The natural language toolkit. Proceedings of the ACL Interactive Poster and Demonstration Sessions, 1(1), 213-217. https://aclanthology.org/P04-3031/
Chakravarthi, B., Priyadharshini, R., Ponnusamy, R., Kumaresan, P., Sampath, K., Thenmozhi, D., Thangasamy, S., Nallathambi, R., & McCrae, J. (2021). Dataset for Identification of Homophobia and Transphobia in Multilingual YouTube Comments. ArXiv. https://doi.org/10.48550/arXiv.2109.00227
Chen, T., & Guestrin, C. (2016, August). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 22(1), 785-794. https://dl.acm.org/doi/10.1145/2939672.2939785
Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. Proceedings of the 11th International AAAI Conference on Web and Social Media, 11(1), 512-515. https://doi.org/10.1609/icwsm.v11i1.14955
Delgado, R., & Stefancic, J. (1995). Ten Arguments against Hate-Speech Regulation: How Valid? Northern Kentucky Law Review. https://scholarship.law.ua.edu/facarticles/564
Engonopoulos, N., Villaba, M., Titov, I., & Koller, A. (2013). Predicting the resolution of referring expressions from user behavior. Proceeding of the 2013 conference on empirical methods in natural language processing, 1(1), 1354-1359. Association for Computation Linguistics. https://aclanthology.org/D13-1134
Everytown. (2022). Hate, violence, and stigma against the LGBTQ+ community. Everytown Research & Policy. https://everytownresearch.org/report/remembering-and-honoring-pulse/
Gilbert, O., Perez, N., García-Pablos, A., & Cuadros, M. (2018). Hate speech dataset from a white supremacy forum. Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), 2(1), 11-20. Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-5102
Gurav, V., Parkar, M., & Kharwar, P. (2019). Accessible and Ethical Data Annotation with the Application of Gamification. International conference on recent developments in science, engineering, and technology, 1230(1), 68-78. REDSET. https://doi.org/10.1007/978-981-15-5830-6_6
Halperin, David M. (2019). Queer Love. Critical Inquiry, 45(2), 396-419. https://doi.org/10.1086/700993
Hovy, D., & Prabhumoye, S. (2021). Five sources of bias in natural language processing. Language and Linguistics Compass, e12432. https://doi.org/10.1111/lnc3.12432
Hubbard, L. (2020) Online Hate Crime Report: Challenging online homophobia, biphobia and transphobia. London: Galop, the LGBT+ anti-violence charity. https://galop.org.uk/resource/online-hate-crime-report-2020/
Kannan, S., & Gurusamy, V. (2014). Preprocessing Techniques for Text Mining.
https://www.researchgate.net/publication/273127322_Preprocessing_Techniques_for_Text_Mining
Kenyon-Dean, K., Newell, E., & Cheung, J. C. (2020). Deconstructing word embedding algorithms. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1(1), 8479–8484. Association for Computational Linguistics. https://doi.org/10.48550/arXiv.2011.07013
Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. New York University Press
Noble, W. S. (2006). What is a support vector machine? Nature biotechnology, 24(12), 1565-1567. https://doi.org/10.1038/nbt1206-1565
Olteanu, A., Castillo, C., Boy, J., & Varshney, K. R. (2018). The effect of extremist violence on hateful speech online. Proceedings of the 11th International AAAI conference on web and social media, 11(1), 221-230. International AAAI Conference on Web and Social Media. https://ojs.aaai.org/index.php/ICWSM/issue/view/271
Patel, A., & Meehan, K. (2021). Fake News Detection on Reddit Utilising CountVectorizer and Term Frequency-Inverse Document Frequency with Logistic Regression, MultinominalNB and Support Vector Machine. 2021 32nd Irish Signals and Systems Conference (ISSC), 32(1), 1-6. https://doi.org/10.1109/ISSC52156.2021.9467842.
Petrillo, M., & Baycroft, J. (2010). Introduction to manual annotation. Fairview Research. https://gate.ac.uk/teamware/man-ann-intro.pdf
QMUNITY. (2019). Queer terminology from A to Q. https://qmunity.ca/wp-content/uploads/2019/06/Queer-Glossary_2019_02.pdf
Swamy, S. D., Jamatia, A., Gämback, B. (2019). Studying generalizability across abusive language detection datasets. Proceedings of the conference on Computational Natural Language Learning (CoNLL), 23(1), 940-950. Association for Computational Linguistics. https://doi.org/ 10.18653/v1/K19-1088
Tsesis, A. (2002). Destructive messages: How hate speech paves the way for harmful social movements. New York University Press
Weitzel, L., Prati, R. C., & Aguiar, R. F. (2016). The comprehension of figurative language: What is the influence of irony and sarcasm on NLP techniques? Springer International Publishing Switzerland. https://doi.org/10.1007/978-3-319-30319-2_3
Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003). Random forest: a classification and regression tool for compound classification and QSAR modeling. Journal of chemical information and computer sciences, 43(6), 1947-1958. https://doi.org/10.1021/ci034160g
Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106. https://doi.org/10.1007/BF00116251
Hossin, M., & Sulaimanm, M. N. (2015). A review on evaluation metrics for data classification evaluations. International journal of data mining & knowledge management process, 5(2), 1-11. https://doi.org/10.5121/ijdkp.2015.5201
Wissler, L., Almashraee, M., & Monett, D. (2014). The gold standard in corpus annotation. 5th IEEE Germany Student Conference, 1(1), 1-4. https://doi.org/10.13140/2.1.4316.3523
Published
How to Cite
Issue
Section
Copyright (c) 2023 Shivum Banerjee; Hieu Nguyen
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright holder(s) granted JSR a perpetual, non-exclusive license to distriute & display this article.