Awesome

Awesome Indonesia NLP

Resouse kumpulan dataset, thesis, paper, dan artikel tentang NLP (Natural Language Processing) Bahasa Indonesia. Terinpirasi oleh para pendahulu.

Daftar Isi

NLP Bahasa Indonesia

Memulai

Materi Pengantar NLP

Pengantar NLP - Kang Edi, PENS.
NLTK Book
Text Mining with R - Julia Silge and David Robinson

Artikel-Artikel Tentang NLP

Karena Data Gak Mungkin Bohong - Jim Geovedi. 2014.
NLP Trend 2019 - Janna, Towards Data Science.

Jurnal

Indonesian News Classification using Support Vector Machine (https://zenodo.org/record/1074439)

Dataset & Language modeling

Words dataset

Word Sastrawi
Word spaCy : id
Word name : random-name
Word Indo name : genderprediction
Word Indo place : Wilayah-Administratif-Indonesia
Word Indo place : Indonesia-Postal-Code
Word Wiktionary : word id
Word sentiment : analisis-sentimen
Word sentiment : ID-OpinionWords
Word sentiment : Analisis-Sentimen-ID
Word Acronims
word : serangkai

Sentences Dataset

leipzig indonesian sentence collectoin news articles, web articles, wikipedia data from 2008-2016
wn-msa.sourceforge.net Wordnet Bahasa
Quran indonesian quran translation (id.muntakhab, id.jalalayn, id.indonesian)
Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
corpus-frog-storytelling spoken text story telling
TED-Multilingual-Parallel-Corpus Monolingual_data/Indonesian
Opus Opus NLPL
Sealang Sealang dataset
[Indonesian News Corpus] (https://data.mendeley.com/datasets/2zpbjs22k3/1)
[INDONESIAN HOAX NEWS DETECTION DATASET] (https://data.mendeley.com/datasets/p3hfgr5j3m/1)
[Warta Berita Online Kompas dan Tempo] (https://ilps.science.uva.nl/resources/bahasa/)
[Raw dataset of Indonesian news articles] (https://github.com/feryandi/Dataset-Artikel)
Amazon Reviews
ArXiv
BimaNLP

Tagged dataset

NER : yohanesgultom/nlp-experiments 1700 sentences
NER : yusufsyaifudin/indonesia-ner 1835 sentences
POS-TAG : famrashel/idn-tagged-corpus
POS-TAG : pebbie/pebahasa ~600 sentence
POS-TAG Parser : UniversalDependencies/UD_Indonesian-GSD ~4477 sentence
Sentimen 1506 sentences
panl10n Pan Localization

Language modeling

POS tagging

PANL10N POS tagging. This corpus has ~39K sentences and ~900K word tokens.
IDN tagged corpus. This corpus contains ~10K sentences and ~250K word tokens. The POS tags are annotated manually.

Syntactic parsing

Indonesian Treebank. This corpus contains ~1K parsed sentences. (constituency parsing)
UD Indonesian. This corpus is provided by Universal Dependencies. Training, development, and testing split is already provided. (dependency parsing)

Machine translation

PANL10N EN-ID news parallel corpus. This corpus has sentences from news articles from several categories: economy (6K sentences), international (6K sentences), science (6K sentences), and sport (4K sentences).
PANL10N Indonesian translation of Penn treebank. This corpus contains Indonesian translation of the Penn treebank. In total there are ~24K sentences.

Speech recognition

TITML-IDN speech corpus. The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances. The utterances are phonetically balanced.

The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution. The procedure is listed here
frankydotid/Indonesian-Speech-Recognition. A small corpus of 50 utterances by a single male speaker.

Automatic Summarization

Frequent Term based Text Summarization for Bahasa Indonesia

M.Fachrurrozi, Novi Yusliani, and Rizky Utami Yoanita. International Conference on Innovations in Engineering and Technology (ICIET'2013) Dec. 25-26, 2013 Bangkok (Thailand).

Parsing

Analisa Struktur Kalimat Bahasa Indonesia dengan Menggunakan Pengurai Kalimat Berbasis Linguistic String Analysis

Shavitri, Shelly. Undergraduate Theses for computer science, University of Indonesia, 1999.
INAGP : Pengurai Kalimat Bahasa Indonesia Sebagai Alat Bantu Untuk Pengembangan Aplikasi PBA

Rosalina Paramita N., Dwi H. Widyantoro, Ayu Purwarianti. Undergraduate Theses from JBPTITBPP, Institute Technology Bandung, 2007.
Penguraian Bahasa Indonesia dengan Menggunakan Pengurai Collins

Sukamto, Rosa Ariani. Tesis untuk Magister, Institut Technology Bandung, 2009.

Part-of-speech Tagging

HMM Based Part-of-Speech Tagger for Bahasa Indonesia

Wicaksono, A. Farizki dan Purwanti, Ayu. Proceeding of 4th International Malindo (Malay and Indonesian Language) Workshop (2010).
Penggunaan Hidden Markov Model untuk Kompresi Kalimat

Yudi Wibisono. Graduate Thesis. Institute of Technology Bandung. 2008.
Probabilistic Part Of Speech Tagging for Bahasa Indonesia

Femphy Pisceldo, Mirna Adriani, Ruli Manurung. Third International MALINDO Workshop, colocated event ACL-IJCNLP 2009, Singapore, August 1, 2009.

Stemming

Effective Techniques for Indonesian Text Retrieval

Asian J. (2007). PhD thesis School of Computer Science and Information Technology RMIT University Australia.
Enhanced Confix Stripping Stemmer and Ants Algorithm for Classifying News Document in Indonesian Language

Arifin, A.Z., I.P.A.K. Mahendra dan H.T. Ciptaningtyas. 2009. Proceeding of International Conference on Information & Communication Technology and Systems (ICTS).
Implementasi Modifikasi Enhanced Confix Stripping Stemmer Untuk Bahasa Indonesia dengan Metode Corpus Based Stemming

A. D. Tahitoe, D. Purwitasari. Institut Teknologi Sepuluh Nopember (ITS) – Surabaya.

Word Sense Disambiguation

Building an Indonesian WordNet

Desmond Darma Putra, Abdul Arfan and Ruli Manurung. In Proceedings of the 2nd International MALINDO Workshop. 2008.
English-to-Indonesian Lexical Mapping using Latent Semantic Analysis

Eliza Margaretha, Franky, and Ruli Manurung. In Proceedings of the 2nd International MALINDO Workshop. 2008.

Lain-lain

A survey of bahasa Indonesia NLP research conducted at the University of Indonesia

Mirna Adriani and Ruli Manurung. Faculty of Computer Science, University of Indonesia.
Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus

Septina Dian Larasati, Vladislav Kuboˇn, and Daniel Zeman. Charles University in Prague.
Research Report on Local Language Computing: Development of Indonesian Language Resources and Translation System

Adriani, Mirna. Riza, Hammam. 2008.
Towards a Semantic Analysis of Bahasa Indonesia for Question Answering

Septina Dian Larasati and Ruli Manurung. Faculty of Computer Science. University of Indonesia. 2007.

Software, Library, Kamus

Kateglo - Kamus, Tesaurus, dan Glosarium Bahasa Indonesia.
Sastrawi - Stemmer PHP untuk Bahasa Indonesia.

Parallel corpus Eng-Ind

Morph

Crawler Data

Crawler Indonesian news portal

Sentiment Analysis

Aspect and Opinion Terms Extraction for Hotel Reviews. The corpus consists of 5000 hotel reviews from Airy (78K tokens) with 5 labels. The paper is available on arXiv.
Aspect-Based Sentiment Analysis. A text classification resource for multi-label aspect categorization.

Syntactic parsing

Indonesian Treebank. This corpus contains 1K parsed sentences. (constituency parsing)
UD Indonesian. This corpus is provided by Universal Dependencies. Training, development, and testing split are already provided. (dependency parsing)

Machine translation

PANL10N EN-ID news parallel corpus. This corpus has sentences from news articles from several categories: economy (6K sentences), international (6K sentences), science (6K sentences), and sport (4K sentences).
PANL10N Indonesian translation of Penn treebank. This corpus contains Indonesian translation of the Penn treebank. In total there are 24K sentences.

Word Normalization

Colloquial Indonesian Lexicon. This lexicon consists of 3592 unique colloquial tokens that are mapped onto 1742 unique lemmas. The full description of this lexicon can be seen in the paper.

Text Summarization

IndoSum. A collection of 20K online news article-summary pairs belonging to 6 categories and 10 sources. It has both abstractive summaries and extractive labels.

Text Classification

SMS Spam. This corpus contains 1143 sentences that have been labeled with normal message, fraud, promotion. It is provided by http://nlp.yuliadi.pro/dataset
Hate Speech Detection. This dataset consists of 713 tweets in the Indonesian language with 453 non hate speech and 260 hate speech tweets.
Abusive Language Detection. A collection of tweets for abusive language detection in Indonesian social media. It consists of two types of labelling, abusive/not abusive and not abusive/abusive but not offensive/offensive. It also has its own colloquial Indonesian lexicon.

Speech recognition

TITML-IDN speech corpus. The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances. The utterances are phonetically balanced. The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution. The procedure is listed here.
Indonesian Speech Recognition. A small corpus of 50 utterances by a single male speaker. Disclaimer: This is a school project, do not use it for any important tasks. The author is not responsible for the undesired results of using the data provided here.
CMU Wilderness Multilingual Speech Dataset. A dataset of over 700 different languages providing audio, aligned texts, and word pronunciations. One of the languages is Indonesian. The utterances are read from the bible, which is recorded by bible.is.

Free Books

Courses

Videos and Lectures

Papers

Tutorials

Sample Code

Datasets

Libraries

Contributing

Jika ingin berkontribusi dalam github ini, sangat disarankan untuk Pull Request namun dengan resource berbahasa indonesia.

Frequently Ask Question (FAQ)

FAQ menjawab pertanyaan pertanyaan umum terkait repository ini mulai dari naming convention, pertanyaan dasar hingga pertanyaan lanjut.

Awesome NLP Papers

This is a collection/reading-list of awesome Natural Language Processing papers sorted by date.

2018

2017:

Attention Is All You Need, Vaswani et al. Paper
Skip-Gram – Zipf + Uniform = Vector Additivity, Gittens et al. Paper
A Simple but Tough-to-beat Baseline for Sentence Embeddings, Arora et al. Paper
Fast and Accurate Entity Recognition with Iterated Dilated Convolutions, Strubell et al. Paper
Advances in Pre-Training Distributed Word Representations, Mikolov et al. Paper
Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets, Dror et al. Paper

2016:

Towards Universal Paraphrastic Sentence Embeddings, Wieting et al. Paper
Bag of Tricks for Efficient Text Classification, Joulin et al. Paper
Enriching Word Vectors with Subword Information, Bojanowski et al. Paper
Assessing the Corpus Size vs. Similarity Trade-off for Word Embeddings in Clinical NLP, Kirk Roberts Paper
How to Train Good Word Embeddings for Biomedical NLP, Chiu et al. Paper
Log-Linear Models, MEMMs, and CRFs, Michael Collins Paper
Counter-fitting Word Vectors to Linguistic Constraints, Mrkšić et al. Paper
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Wu et al. Paper

2015:

Semi-supervised Sequence Learning, Dai et al. Paper
Evaluating distributed word representations for capturing semantics of biomedical concepts, Th et al. Paper

2014:

GloVe: Global Vectors for Word Representation, Pennington et al. Paper
Linguistic Regularities in Sparse and Explicit Word Representations, Levy and Goldberg. Paper
Neural Word Embedding as Implicit Matrix Factorization, Levy and Goldberg. Paper
word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method, Goldberg and Levy. Paper
What’s in a p-value in NLP?, Søgaard et al. Paper
How transferable are features in deep neural networks?, Yosinski et al. Paper
Improving lexical embeddings with semantic knowledge, Yu et al. Paper
Retrofitting word vectors to semantic lexicons, Faruqui et al. Paper

2013:

Efficient Estimation of Word Representations in Vector Space, Mikolov et al. Paper
Linguistic Regularities in Continuous Space Word Representations, Mikolov et al. Paper
Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al. Paper

2012:

An Empirical Investigation of Statistical Significance in NLP, Berg-Kirkpatrick et al. Paper

2010:

Word representations: A simple and general method for semi-supervised learning, Turian et al. Paper

2008:

A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning, Collobert and Weston. Paper

2006:

Domain adaptation with structural correspondence learning, Blitzer et al. Paper

2003:

A Neural Probabilistic Language Model, Bengio et al. Paper

1986:

Distributed Representations, Hinton et al. Paper

Awesome

Awesome Indonesia NLP

Daftar Isi

Memulai

Materi Pengantar NLP

Artikel-Artikel Tentang NLP

Jurnal

Dataset & Language modeling

Words dataset

Sentences Dataset

Tagged dataset

Language modeling

POS tagging

Syntactic parsing

Machine translation

Speech recognition

Automatic Summarization

Parsing

Part-of-speech Tagging

Stemming

Word Sense Disambiguation

Lain-lain

Software, Library, Kamus

Word reference (kemdikbud) link

Parallel corpus Eng-Ind

Morph

Crawler Data

Sentiment Analysis

Syntactic parsing

Machine translation

Word Normalization

Text Summarization

Text Classification

Speech recognition

Free Books

Courses

Videos and Lectures

Papers

Tutorials

Sample Code

Datasets

Libraries

Contributing

Frequently Ask Question (FAQ)

Awesome NLP Papers

2018

2017:

2016:

2015:

2014:

2013:

2012:

2010:

2008:

2006:

2003:

1986: