Dataset for identification of queerphobia


  • Shivum Banerjee Hinsdale Central High School
  • Hieu Nguyen University of Colorado Denver



queerphobia, hate speech, dataset, natural language processing, NLP, queer, homophobia, transphobia, sentiment analysis, machine learning, supervised learning algorithms


While social media platforms have implemented many algorithmic approaches to moderating hate speech, there is a lack of datasets on queerphobia which has impeded efforts to automatically recognize and moderate queerphobic hate speech online. Queerphobic hate speech is speech that is intended to degrade, insult, or incite violence or prejudicial action against queer people, who are those from a sexuality, gender, or romantic minority. This speech results in worsened mental and emotional outcomes for queer people and can contribute to anti-queer violence. The goal of this study is to create a dataset of queerphobic YouTube comments to further efforts to identify and moderate queerphobic hate speech. To construct this dataset, 10,000 comments were sourced from YouTube videos which represent queerness. Then, volunteers manually annotated each comment in accordance with specific guidelines. Various natural language processing (NLP) models were used to extract features from the text, and several classifiers used these features to categorize comments as queerphobic or non-queerphobic. These NLP models illustrate a baseline for performance on this data. In making this dataset, we hope to further research in the recognition of digital queerphobia and make social media platforms safer for queer people. The dataset can be found at


Download data is not yet available.

References or Bibliography

Banerjee, S. (2023). Dataset for identification of queerphobia. Github.

Bird, S. (2006). NLTK: The natural language toolkit. Proceedings of the ACL Interactive Poster and Demonstration Sessions, 1(1), 213-217.

Chakravarthi, B., Priyadharshini, R., Ponnusamy, R., Kumaresan, P., Sampath, K., Thenmozhi, D., Thangasamy, S., Nallathambi, R., & McCrae, J. (2021). Dataset for Identification of Homophobia and Transphobia in Multilingual YouTube Comments. ArXiv.

Chen, T., & Guestrin, C. (2016, August). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 22(1), 785-794.

Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. Proceedings of the 11th International AAAI Conference on Web and Social Media, 11(1), 512-515.

Delgado, R., & Stefancic, J. (1995). Ten Arguments against Hate-Speech Regulation: How Valid? Northern Kentucky Law Review.

Engonopoulos, N., Villaba, M., Titov, I., & Koller, A. (2013). Predicting the resolution of referring expressions from user behavior. Proceeding of the 2013 conference on empirical methods in natural language processing, 1(1), 1354-1359. Association for Computation Linguistics.

Everytown. (2022). Hate, violence, and stigma against the LGBTQ+ community. Everytown Research & Policy.

Gilbert, O., Perez, N., García-Pablos, A., & Cuadros, M. (2018). Hate speech dataset from a white supremacy forum. Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), 2(1), 11-20. Association for Computational Linguistics.

Gurav, V., Parkar, M., & Kharwar, P. (2019). Accessible and Ethical Data Annotation with the Application of Gamification. International conference on recent developments in science, engineering, and technology, 1230(1), 68-78. REDSET.

Halperin, David M. (2019). Queer Love. Critical Inquiry, 45(2), 396-419.

Hovy, D., & Prabhumoye, S. (2021). Five sources of bias in natural language processing. Language and Linguistics Compass, e12432.

Hubbard, L. (2020) Online Hate Crime Report: Challenging online homophobia, biphobia and transphobia. London: Galop, the LGBT+ anti-violence charity.

Kannan, S., & Gurusamy, V. (2014). Preprocessing Techniques for Text Mining.

Kenyon-Dean, K., Newell, E., & Cheung, J. C. (2020). Deconstructing word embedding algorithms. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1(1), 8479–8484. Association for Computational Linguistics.

Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. New York University Press

Noble, W. S. (2006). What is a support vector machine? Nature biotechnology, 24(12), 1565-1567.

Olteanu, A., Castillo, C., Boy, J., & Varshney, K. R. (2018). The effect of extremist violence on hateful speech online. Proceedings of the 11th International AAAI conference on web and social media, 11(1), 221-230. International AAAI Conference on Web and Social Media.

Patel, A., & Meehan, K. (2021). Fake News Detection on Reddit Utilising CountVectorizer and Term Frequency-Inverse Document Frequency with Logistic Regression, MultinominalNB and Support Vector Machine. 2021 32nd Irish Signals and Systems Conference (ISSC), 32(1), 1-6.

Petrillo, M., & Baycroft, J. (2010). Introduction to manual annotation. Fairview Research.

QMUNITY. (2019). Queer terminology from A to Q.

Swamy, S. D., Jamatia, A., Gämback, B. (2019). Studying generalizability across abusive language detection datasets. Proceedings of the conference on Computational Natural Language Learning (CoNLL), 23(1), 940-950. Association for Computational Linguistics. 10.18653/v1/K19-1088

Tsesis, A. (2002). Destructive messages: How hate speech paves the way for harmful social movements. New York University Press

Weitzel, L., Prati, R. C., & Aguiar, R. F. (2016). The comprehension of figurative language: What is the influence of irony and sarcasm on NLP techniques? Springer International Publishing Switzerland.

Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003). Random forest: a classification and regression tool for compound classification and QSAR modeling. Journal of chemical information and computer sciences, 43(6), 1947-1958.

Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106.

Hossin, M., & Sulaimanm, M. N. (2015). A review on evaluation metrics for data classification evaluations. International journal of data mining & knowledge management process, 5(2), 1-11.

Wissler, L., Almashraee, M., & Monett, D. (2014). The gold standard in corpus annotation. 5th IEEE Germany Student Conference, 1(1), 1-4.



How to Cite

Banerjee, S., & Nguyen, H. (2023). Dataset for identification of queerphobia. Journal of Student Research, 12(1).



HS Research Projects