Awesome
idn-tagged-corpus-CSUI
Summary
Idn-tagged-corpus-CSUI is a manually tagged Indonesian POS tagging corpus consists of 10000 sentences.
Data Format
Each line consists of token with its respective part-of-speech tag separated by a tab character(\t). There is an empty line between sentences.
Format Data (versi Bahasa Indonesia)
Korpus ini menggunakan format tab-separated file (.tsv). Setiap baris berisi token beserta part-of-speech tag dari token tersebut yang terpisahkan oleh satu karakter tab(\t). Antar kalimat dipisahkan oleh satu baris kosong.
References
Authors
- Ruli Manurung
- Arawinda Dinakaramani
- Fam Rashel
- Andry Luthfi
@inproceedings{Dinakaramani2014,
author = {Dinakaramani, Arawinda and Rashel, Fam and Luthfi, Andry and Manurung, Ruli},
booktitle = {Proceedings of the International Conference on Asian Language Processing 2014, IALP 2014},
doi = {10.1109/IALP.2014.6973519},
pages = {66--69},
title = {{Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus}},
year = {2014}
}
Page
For more details about this work, please visit http://bahasa.cs.ui.ac.id/postag/corpus
Changelog
-
2022
- The dataset was moved to the IR-NLP Lab repository
- The dataset name was changed from idn-tagged-corpus to idn-tagged-corpus-CSUI
-
2014
- Initial release at Fam Rashel's repository
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
License
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.
You can use this dataset for free. You don't need our permission to use it. Please cite our paper if your work uses our data in your publication. Please note that you are not allowed to create a copy of this dataset and share it publicly in your own repository without our permission.
Contact
arawinda [at] cs.ui.ac.id