



Idn-tagged-corpus-CSUI is a manually tagged Indonesian POS tagging corpus consists of 10000 sentences.

Data Format

Each line consists of token with its respective part-of-speech tag separated by a tab character(\t). There is an empty line between sentences.

Format Data (versi Bahasa Indonesia)

Korpus ini menggunakan format tab-separated file (.tsv). Setiap baris berisi token beserta part-of-speech tag dari token tersebut yang terpisahkan oleh satu karakter tab(\t). Antar kalimat dipisahkan oleh satu baris kosong.



author = {Dinakaramani, Arawinda and Rashel, Fam and Luthfi, Andry and Manurung, Ruli},
booktitle = {Proceedings of the International Conference on Asian Language Processing 2014, IALP 2014},
doi = {10.1109/IALP.2014.6973519},
pages = {66--69},
title = {{Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus}},
year = {2014} }


For more details about this work, please visit http://bahasa.cs.ui.ac.id/postag/corpus



This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.

You can use this dataset for free. You don't need our permission to use it. Please cite our paper if your work uses our data in your publication. Please note that you are not allowed to create a copy of this dataset and share it publicly in your own repository without our permission.


arawinda [at] cs.ui.ac.id