Awesome

idn-tagged-corpus-CSUI

Summary

Idn-tagged-corpus-CSUI is a manually tagged Indonesian POS tagging corpus consists of 10000 sentences.

Data Format

Each line consists of token with its respective part-of-speech tag separated by a tab character(\t). There is an empty line between sentences.

Format Data (versi Bahasa Indonesia)

Korpus ini menggunakan format tab-separated file (.tsv). Setiap baris berisi token beserta part-of-speech tag dari token tersebut yang terpisahkan oleh satu karakter tab(\t). Antar kalimat dipisahkan oleh satu baris kosong.

References

Authors

Ruli Manurung
Arawinda Dinakaramani
Fam Rashel
Andry Luthfi

@inproceedings{Dinakaramani2014,
author = {Dinakaramani, Arawinda and Rashel, Fam and Luthfi, Andry and Manurung, Ruli},
booktitle = {Proceedings of the International Conference on Asian Language Processing 2014, IALP 2014},
doi = {10.1109/IALP.2014.6973519},
pages = {66--69},
title = {{Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus}},
year = {2014} }

Page

For more details about this work, please visit http://bahasa.cs.ui.ac.id/postag/corpus

Changelog

2022
- The dataset was moved to the IR-NLP Lab repository
- The dataset name was changed from idn-tagged-corpus to idn-tagged-corpus-CSUI
2014
- Initial release at Fam Rashel's repository

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.

You can use this dataset for free. You don't need our permission to use it. Please cite our paper if your work uses our data in your publication. Please note that you are not allowed to create a copy of this dataset and share it publicly in your own repository without our permission.

Contact

arawinda [at] cs.ui.ac.id