Home

Awesome

id-abusive-language-detection

About this data

Here we provide our dataset for abusive language detection in the Indonesian language. This dataset is provided in two types of labeling:

Due to the Twitter's Terms of Service, we do not provide the tweet ID. All username and URL in this dataset are changed into USER and URL.

For text normalization in our experiment, we build small typo and slang words dictionaries named kamusalay.csv, that contain two columns (first columns are the typo and slang words, and the second one is the formal words). Here the examples of mapping:

More detail

If you want to know how this dataset was build (including the explanation of crawling and annotation technique) and how we did our experiment in abusive language detection in Indonesian language using this dataset, you can read our paper in here: https://www.sciencedirect.com/science/article/pii/S1877050918314583.

How to cite us

This dataset can be used for free, but if you want to publish paper/publication using this dataset, please cite this publication:

Ibrohim, M.O., Budi, I.. A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media. Procedia Computer Science 2018;135:222-229. (Every paper template may have different citation writting. For LaTex user, you can see citation.bib).

License

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.