Awesome
id-abusive-language-detection
About this data
Here we provide our dataset for abusive language detection in the Indonesian language. This dataset is provided in two types of labeling:
- In re_dataset_two_labels.csv, the dataset coded into two labels, that are
1
(not abusive language) and2
(abusive language); - In re_dataset_three_labels.csv, the dataset coded into three labels, that are
1
(not abusive language),2
(abusive but not offensive), and3
(offensive language).
Due to the Twitter's Terms of Service, we do not provide the tweet ID. All username and URL in this dataset are changed into USER and URL.
For text normalization in our experiment, we build small typo and slang words dictionaries named kamusalay.csv, that contain two columns (first columns are the typo and slang words, and the second one is the formal words). Here the examples of mapping:
- beud --> banget
- jgn --> jangan
- loe --> kamu
More detail
If you want to know how this dataset was build (including the explanation of crawling and annotation technique) and how we did our experiment in abusive language detection in Indonesian language using this dataset, you can read our paper in here: https://www.sciencedirect.com/science/article/pii/S1877050918314583.
How to cite us
This dataset can be used for free, but if you want to publish paper/publication using this dataset, please cite this publication:
Ibrohim, M.O., Budi, I.. A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media. Procedia Computer Science 2018;135:222-229. (Every paper template may have different citation writting. For LaTex user, you can see citation.bib).
License
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.