Dataset for identification of queerphobia

Authors

  • Shivum Banerjee Hinsdale Central High School
  • Hieu Nguyen University of Colorado Denver

DOI:

https://doi.org/10.47611/jsrhs.v12i1.4405

Keywords:

queerphobia, hate speech, dataset, natural language processing, NLP, queer, homophobia, transphobia, sentiment analysis, machine learning, supervised learning algorithms

Abstract

While social media platforms have implemented many algorithmic approaches to moderating hate speech, there is a lack of datasets on queerphobia which has impeded efforts to automatically recognize and moderate queerphobic hate speech online. Queerphobic hate speech is speech that is intended to degrade, insult, or incite violence or prejudicial action against queer people, who are those from a sexuality, gender, or romantic minority. This speech results in worsened mental and emotional outcomes for queer people and can contribute to anti-queer violence. The goal of this study is to create a dataset of queerphobic YouTube comments to further efforts to identify and moderate queerphobic hate speech. To construct this dataset, 10,000 comments were sourced from YouTube videos which represent queerness. Then, volunteers manually annotated each comment in accordance with specific guidelines. Various natural language processing (NLP) models were used to extract features from the text, and several classifiers used these features to categorize comments as queerphobic or non-queerphobic. These NLP models illustrate a baseline for performance on this data. In making this dataset, we hope to further research in the recognition of digital queerphobia and make social media platforms safer for queer people. The dataset can be found at https://github.com/ShivumB/dataset-for-identification-of-queerphobia.

Downloads

Download data is not yet available.

References or Bibliography

Banerjee, S. (2023). Dataset for identification of queerphobia. Github. https://github.com/ShivumB/Dataset-for-Identification-of-Queerphobia

Bird, S. (2006). NLTK: The natural language toolkit. Proceedings of the ACL Interactive Poster and Demonstration Sessions, 1(1), 213-217. https://aclanthology.org/P04-3031/

Chakravarthi, B., Priyadharshini, R., Ponnusamy, R., Kumaresan, P., Sampath, K., Thenmozhi, D., Thangasamy, S., Nallathambi, R., & McCrae, J. (2021). Dataset for Identification of Homophobia and Transphobia in Multilingual YouTube Comments. ArXiv. https://doi.org/10.48550/arXiv.2109.00227

Chen, T., & Guestrin, C. (2016, August). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 22(1), 785-794. https://dl.acm.org/doi/10.1145/2939672.2939785

Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. Proceedings of the 11th International AAAI Conference on Web and Social Media, 11(1), 512-515. https://doi.org/10.1609/icwsm.v11i1.14955

Delgado, R., & Stefancic, J. (1995). Ten Arguments against Hate-Speech Regulation: How Valid? Northern Kentucky Law Review. https://scholarship.law.ua.edu/facarticles/564

Engonopoulos, N., Villaba, M., Titov, I., & Koller, A. (2013). Predicting the resolution of referring expressions from user behavior. Proceeding of the 2013 conference on empirical methods in natural language processing, 1(1), 1354-1359. Association for Computation Linguistics. https://aclanthology.org/D13-1134

Everytown. (2022). Hate, violence, and stigma against the LGBTQ+ community. Everytown Research & Policy. https://everytownresearch.org/report/remembering-and-honoring-pulse/

Gilbert, O., Perez, N., García-Pablos, A., & Cuadros, M. (2018). Hate speech dataset from a white supremacy forum. Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), 2(1), 11-20. Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-5102

Gurav, V., Parkar, M., & Kharwar, P. (2019). Accessible and Ethical Data Annotation with the Application of Gamification. International conference on recent developments in science, engineering, and technology, 1230(1), 68-78. REDSET. https://doi.org/10.1007/978-981-15-5830-6_6

Halperin, David M. (2019). Queer Love. Critical Inquiry, 45(2), 396-419. https://doi.org/10.1086/700993

Hovy, D., & Prabhumoye, S. (2021). Five sources of bias in natural language processing. Language and Linguistics Compass, e12432. https://doi.org/10.1111/lnc3.12432

Hubbard, L. (2020) Online Hate Crime Report: Challenging online homophobia, biphobia and transphobia. London: Galop, the LGBT+ anti-violence charity. https://galop.org.uk/resource/online-hate-crime-report-2020/

Kannan, S., & Gurusamy, V. (2014). Preprocessing Techniques for Text Mining.

https://www.researchgate.net/publication/273127322_Preprocessing_Techniques_for_Text_Mining

Kenyon-Dean, K., Newell, E., & Cheung, J. C. (2020). Deconstructing word embedding algorithms. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1(1), 8479–8484. Association for Computational Linguistics. https://doi.org/10.48550/arXiv.2011.07013

Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. New York University Press

Noble, W. S. (2006). What is a support vector machine? Nature biotechnology, 24(12), 1565-1567. https://doi.org/10.1038/nbt1206-1565

Olteanu, A., Castillo, C., Boy, J., & Varshney, K. R. (2018). The effect of extremist violence on hateful speech online. Proceedings of the 11th International AAAI conference on web and social media, 11(1), 221-230. International AAAI Conference on Web and Social Media. https://ojs.aaai.org/index.php/ICWSM/issue/view/271

Patel, A., & Meehan, K. (2021). Fake News Detection on Reddit Utilising CountVectorizer and Term Frequency-Inverse Document Frequency with Logistic Regression, MultinominalNB and Support Vector Machine. 2021 32nd Irish Signals and Systems Conference (ISSC), 32(1), 1-6. https://doi.org/10.1109/ISSC52156.2021.9467842.

Petrillo, M., & Baycroft, J. (2010). Introduction to manual annotation. Fairview Research. https://gate.ac.uk/teamware/man-ann-intro.pdf

QMUNITY. (2019). Queer terminology from A to Q. https://qmunity.ca/wp-content/uploads/2019/06/Queer-Glossary_2019_02.pdf

Swamy, S. D., Jamatia, A., Gämback, B. (2019). Studying generalizability across abusive language detection datasets. Proceedings of the conference on Computational Natural Language Learning (CoNLL), 23(1), 940-950. Association for Computational Linguistics. https://doi.org/ 10.18653/v1/K19-1088

Tsesis, A. (2002). Destructive messages: How hate speech paves the way for harmful social movements. New York University Press

Weitzel, L., Prati, R. C., & Aguiar, R. F. (2016). The comprehension of figurative language: What is the influence of irony and sarcasm on NLP techniques? Springer International Publishing Switzerland. https://doi.org/10.1007/978-3-319-30319-2_3

Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003). Random forest: a classification and regression tool for compound classification and QSAR modeling. Journal of chemical information and computer sciences, 43(6), 1947-1958. https://doi.org/10.1021/ci034160g

Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106. https://doi.org/10.1007/BF00116251

Hossin, M., & Sulaimanm, M. N. (2015). A review on evaluation metrics for data classification evaluations. International journal of data mining & knowledge management process, 5(2), 1-11. https://doi.org/10.5121/ijdkp.2015.5201

Wissler, L., Almashraee, M., & Monett, D. (2014). The gold standard in corpus annotation. 5th IEEE Germany Student Conference, 1(1), 1-4. https://doi.org/10.13140/2.1.4316.3523

Published

02-28-2023

How to Cite

Banerjee, S., & Nguyen, H. (2023). Dataset for identification of queerphobia. Journal of Student Research, 12(1). https://doi.org/10.47611/jsrhs.v12i1.4405

Issue

Section

HS Research Projects