Home

Awesome

DOI

<h2 align="center"> HateBR - Offensive Language and Hate Speech Dataset in Brazilian Portuguese </h2> </br> <p align="justify"> HateBR is the first large-scale expert annotated dataset of Brazilian Instagram comments for abusive language detection on the web and social media. The HateBR was collected from Brazilian Instagram comments of politicians and manually annotated by specialists. It is composed of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level (highly, moderately, and slightly offensive messages), and 9 (nine) hate speech targets (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). Each comment was annotated by three different annotators and achieved high inter-annotator agreement. Furthermore, baseline experiments were implemented reaching 85% of the F1-score outperforming the current literature dataset baselines for the Portuguese language. We hope that the proposed expert annotated dataset may foster research on hate speech detection in the Natural Language Processing area. </p>
<p align="justify"> This repository contains the corpus and the best models presented in the paper (see section "citing"). <b>HateBr.csv file</b> provides 4 (four) columns as described above: </p>

The following table describes in detail the labels for each proposed layer of annotation:

<div align="center"> <table> <tr><th>Offensive Language</th><th>Offensiveness Levels</th><th>Hate Speech</th></tr> <tr><td>
classlabeltotal
offensive13,500
non-offensive03,500
Total7,000
</td><td>
classlabeltotal
highly3778
moderately21,044
slightly11,678
non-offensive03,500
Total7,000
</td><td>
classlabeltotal
antisemitism12
apology for the dictatorship232
fatphobia327
homophobia417
partyism5496
racism68
religious intolerance747
sexism897
xenophobia91
offensive & non-hate speech-12,773
non-offensive03,500
Total7,000
</td></tr></table> </div> </br>

In addition, we also provide baseline machine learning results for both tasks: offensive language and hate speech detection. The best-obtained models are available here in .pkl files. File names are organized as [classification (offensive or hate)_representation (ngram or tfidf)_algorithms (nb, svm, mlp or lr)]. For example, the file offensive_tfidf_svm.pkl presents the model of offensive detection with tf-idf representation using the support vector machine algorithm.

</br> <h2 align="left"> CITING </h2> <p align="justify"> Vargas, F., Carvalho, I., Góes, F. R., Pardo, T.A.S., Benevenuto, F. (2022). <b>HateBR: large expert annotated corpus of Brazilian Instagram comments for offensive language and hate speech detection</b>. Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 2022), pp.7174–7183. Marseille, France. https://aclanthology.org/2022.lrec-1.777/ </p> <br> <p align="justify"> Vargas, F., Carvalho, I., Pardo, T.A.S., Benevenuto, F. (2024). <b>Context-Aware and Expert Data Resources for Brazilian Portuguese Hate Speech Detection</b>. Natural Language Processing Journal. Cambridge University Press. pp.1-22. https://www.cambridge.org/core/journals/natural-language-processing/article/contextaware-and-expert-data-resources-for-brazilian-portuguese-hate-speech-detection/7D9019ED5471CD16E320EBED06A6E923#. </p> <br> <h2 align="left"> BIBTEX </h2> <p align="justify"> @inproceedings{vargas-etal-2022-hatebr, title = "{H}ate{BR}: A Large Expert Annotated Corpus of {B}razilian {I}nstagram Comments for Offensive Language and Hate Speech Detection", author = "Vargas, Francielle and Carvalho, Isabelle and Rodrigues de G{\'o}es, Fabiana and Pardo, Thiago and Benevenuto, Fabr{\'\i}cio", editor = "Calzolari, Nicoletta and B{\'e}chet, Fr{\'e}d{\'e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{\'e}l{\`e}ne and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", month = jun, year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.777", pages = "7174--7183", } </p> <br> </br>

@article{Vargas_Carvalho_Pardo_Benevenuto_2024, author={Vargas, Francielle and Carvalho, Isabelle and Pardo, Thiago A. S. and Benevenuto, Fabrício}, title={Context-aware and expert data resources for Brazilian Portuguese hate speech detection}, DOI={10.1017/nlp.2024.18}, journal={Natural Language Processing},
year={2024}, pages={1–22}, url={https://www.cambridge.org/core/journals/natural-language-processing/article/contextaware-and-expert-data-resources-for-brazilian-portuguese-hate-speech-detection/7D9019ED5471CD16E320EBED06A6E923#}, }

<div></div> <br> </br> <h2 align="left"> FUNDING </h2>

SSC-logo-300x171 </br>