Awesome
Awesome Indonesia NLP
Resouse kumpulan dataset, thesis, paper, dan artikel tentang NLP (Natural Language Processing) Bahasa Indonesia. Terinpirasi oleh para pendahulu.
Daftar Isi
Memulai
Materi Pengantar NLP
- Pengantar NLP - Kang Edi, PENS.
- NLTK Book
- Text Mining with R - Julia Silge and David Robinson
Artikel-Artikel Tentang NLP
- Karena Data Gak Mungkin Bohong - Jim Geovedi. 2014.
- NLP Trend 2019 - Janna, Towards Data Science.
Jurnal
- Indonesian News Classification using Support Vector Machine (https://zenodo.org/record/1074439)
Dataset & Language modeling
Words dataset
- Word Sastrawi
- Word spaCy : id
- Word name : random-name
- Word Indo name : genderprediction
- Word Indo place : Wilayah-Administratif-Indonesia
- Word Indo place : Indonesia-Postal-Code
- Word Wiktionary : word id
- Word sentiment : analisis-sentimen
- Word sentiment : ID-OpinionWords
- Word sentiment : Analisis-Sentimen-ID
- Word Acronims
- word : serangkai
Sentences Dataset
- leipzig indonesian sentence collectoin news articles, web articles, wikipedia data from 2008-2016
- wn-msa.sourceforge.net Wordnet Bahasa
- Quran indonesian quran translation (id.muntakhab, id.jalalayn, id.indonesian)
- Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
- Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
- corpus-frog-storytelling spoken text story telling
- TED-Multilingual-Parallel-Corpus Monolingual_data/Indonesian
- Opus Opus NLPL
- Sealang Sealang dataset
- [Indonesian News Corpus] (https://data.mendeley.com/datasets/2zpbjs22k3/1)
- [INDONESIAN HOAX NEWS DETECTION DATASET] (https://data.mendeley.com/datasets/p3hfgr5j3m/1)
- [Warta Berita Online Kompas dan Tempo] (https://ilps.science.uva.nl/resources/bahasa/)
- [Raw dataset of Indonesian news articles] (https://github.com/feryandi/Dataset-Artikel)
- Amazon Reviews
- ArXiv
- BimaNLP
Tagged dataset
- NER : yohanesgultom/nlp-experiments 1700 sentences
- NER : yusufsyaifudin/indonesia-ner 1835 sentences
- POS-TAG : famrashel/idn-tagged-corpus
- POS-TAG : pebbie/pebahasa ~600 sentence
- POS-TAG Parser : UniversalDependencies/UD_Indonesian-GSD ~4477 sentence
- Sentimen 1506 sentences
- panl10n Pan Localization
Language modeling
POS tagging
- PANL10N POS tagging. This corpus has ~39K sentences and ~900K word tokens.
- IDN tagged corpus. This corpus contains ~10K sentences and ~250K word tokens. The POS tags are annotated manually.
Syntactic parsing
- Indonesian Treebank. This corpus contains ~1K parsed sentences. (constituency parsing)
- UD Indonesian. This corpus is provided by Universal Dependencies. Training, development, and testing split is already provided. (dependency parsing)
Machine translation
- PANL10N EN-ID news parallel corpus. This corpus has sentences from news articles from several categories: economy (6K sentences), international (6K sentences), science (6K sentences), and sport (4K sentences).
- PANL10N Indonesian translation of Penn treebank. This corpus contains Indonesian translation of the Penn treebank. In total there are ~24K sentences.
Speech recognition
-
TITML-IDN speech corpus. The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances. The utterances are phonetically balanced.
The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution. The procedure is listed here
-
frankydotid/Indonesian-Speech-Recognition. A small corpus of 50 utterances by a single male speaker.
Automatic Summarization
-
Frequent Term based Text Summarization for Bahasa Indonesia
M.Fachrurrozi, Novi Yusliani, and Rizky Utami Yoanita. International Conference on Innovations in Engineering and Technology (ICIET'2013) Dec. 25-26, 2013 Bangkok (Thailand).
Parsing
-
Shavitri, Shelly. Undergraduate Theses for computer science, University of Indonesia, 1999.
-
INAGP : Pengurai Kalimat Bahasa Indonesia Sebagai Alat Bantu Untuk Pengembangan Aplikasi PBA
Rosalina Paramita N., Dwi H. Widyantoro, Ayu Purwarianti. Undergraduate Theses from JBPTITBPP, Institute Technology Bandung, 2007.
-
Penguraian Bahasa Indonesia dengan Menggunakan Pengurai Collins
Sukamto, Rosa Ariani. Tesis untuk Magister, Institut Technology Bandung, 2009.
Part-of-speech Tagging
-
HMM Based Part-of-Speech Tagger for Bahasa Indonesia
Wicaksono, A. Farizki dan Purwanti, Ayu. Proceeding of 4th International Malindo (Malay and Indonesian Language) Workshop (2010).
-
Penggunaan Hidden Markov Model untuk Kompresi Kalimat
Yudi Wibisono. Graduate Thesis. Institute of Technology Bandung. 2008.
-
Probabilistic Part Of Speech Tagging for Bahasa Indonesia
Femphy Pisceldo, Mirna Adriani, Ruli Manurung. Third International MALINDO Workshop, colocated event ACL-IJCNLP 2009, Singapore, August 1, 2009.
Stemming
-
Effective Techniques for Indonesian Text Retrieval
Asian J. (2007). PhD thesis School of Computer Science and Information Technology RMIT University Australia.
-
Arifin, A.Z., I.P.A.K. Mahendra dan H.T. Ciptaningtyas. 2009. Proceeding of International Conference on Information & Communication Technology and Systems (ICTS).
-
A. D. Tahitoe, D. Purwitasari. Institut Teknologi Sepuluh Nopember (ITS) – Surabaya.
Word Sense Disambiguation
-
Building an Indonesian WordNet
Desmond Darma Putra, Abdul Arfan and Ruli Manurung. In Proceedings of the 2nd International MALINDO Workshop. 2008.
-
English-to-Indonesian Lexical Mapping using Latent Semantic Analysis
Eliza Margaretha, Franky, and Ruli Manurung. In Proceedings of the 2nd International MALINDO Workshop. 2008.
Lain-lain
-
A survey of bahasa Indonesia NLP research conducted at the University of Indonesia
Mirna Adriani and Ruli Manurung. Faculty of Computer Science, University of Indonesia.
-
Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus
Septina Dian Larasati, Vladislav Kuboˇn, and Daniel Zeman. Charles University in Prague.
-
Adriani, Mirna. Riza, Hammam. 2008.
-
Towards a Semantic Analysis of Bahasa Indonesia for Question Answering
Septina Dian Larasati and Ruli Manurung. Faculty of Computer Science. University of Indonesia. 2007.
Software, Library, Kamus
- Kateglo - Kamus, Tesaurus, dan Glosarium Bahasa Indonesia.
- Sastrawi - Stemmer PHP untuk Bahasa Indonesia.
Word reference (kemdikbud) link
- Entri Dasar : 48.748 (44,64 %)
- Kata Turunan : 26.312 (24,09 %)
- Gabungan Kata : 30.625 (28,04 %)
- Peribahasa : 2.040 (1,87 %)
- Kiasan : 268 (0,25 %)
- Ungkapan : 1.129 (1,03 %)
- Varian : 91 (0,08 %)
- Entri Total : 109.213 (100,00 %)
- Makna Total : 127.775
- Contoh Total : 29.495
- Kategori Total : 255
- Makna Per Entri : 1,170
- Contoh Per Makna : 0,231
Parallel corpus Eng-Ind
Morph
Crawler Data
- Crawler Indonesian news portal
Sentiment Analysis
- Aspect and Opinion Terms Extraction for Hotel Reviews. The corpus consists of 5000 hotel reviews from Airy (78K tokens) with 5 labels. The paper is available on arXiv.
- Aspect-Based Sentiment Analysis. A text classification resource for multi-label aspect categorization.
Syntactic parsing
- Indonesian Treebank. This corpus contains 1K parsed sentences. (constituency parsing)
- UD Indonesian. This corpus is provided by Universal Dependencies. Training, development, and testing split are already provided. (dependency parsing)
Machine translation
- PANL10N EN-ID news parallel corpus. This corpus has sentences from news articles from several categories: economy (6K sentences), international (6K sentences), science (6K sentences), and sport (4K sentences).
- PANL10N Indonesian translation of Penn treebank. This corpus contains Indonesian translation of the Penn treebank. In total there are 24K sentences.
Word Normalization
- Colloquial Indonesian Lexicon. This lexicon consists of 3592 unique colloquial tokens that are mapped onto 1742 unique lemmas. The full description of this lexicon can be seen in the paper.
Text Summarization
- IndoSum. A collection of 20K online news article-summary pairs belonging to 6 categories and 10 sources. It has both abstractive summaries and extractive labels.
Text Classification
- SMS Spam. This corpus contains 1143 sentences that have been labeled with normal message, fraud, promotion. It is provided by http://nlp.yuliadi.pro/dataset
- Hate Speech Detection. This dataset consists of 713 tweets in the Indonesian language with 453 non hate speech and 260 hate speech tweets.
- Abusive Language Detection. A collection of tweets for abusive language detection in Indonesian social media. It consists of two types of labelling, abusive/not abusive and not abusive/abusive but not offensive/offensive. It also has its own colloquial Indonesian lexicon.
Speech recognition
- TITML-IDN speech corpus. The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances. The utterances are phonetically balanced. The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution. The procedure is listed here.
- Indonesian Speech Recognition. A small corpus of 50 utterances by a single male speaker. Disclaimer: This is a school project, do not use it for any important tasks. The author is not responsible for the undesired results of using the data provided here.
- CMU Wilderness Multilingual Speech Dataset. A dataset of over 700 different languages providing audio, aligned texts, and word pronunciations. One of the languages is Indonesian. The utterances are read from the bible, which is recorded by bible.is.
Free Books
Courses
Videos and Lectures
- 2016 CS224D Deep Learning For Natural Language Processing Lecture Videos
- Natural Language Processing
Papers
- Breaking Sticks and Ambiguities with Adaptive Skip-gram
- Distributed Representations of Words and Phrases and their Compositionality
- Learning the Dimensionality of Word Embeddings
- Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols
- Skip Thought Vectors
Tutorials
- Natural Language Processing
- Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK
- Multi-Class Classification Tutorial with the Keras Deep Learning Library
- Topic Modeling with Scikit Learn
- Data Science with Python & R: Sentiment Classification Using Linear Methods
Sample Code
- Sentiment
- Prediksi Gender Nama
- Topic Modeling
- POS Tagging NLTK (Bahasa Indonesia)
- Naive Bayes Document Classifier (Bahasa Indonesia)
Datasets
Libraries
Contributing
Jika ingin berkontribusi dalam github ini, sangat disarankan untuk Pull Request
namun dengan resource berbahasa indonesia.
Frequently Ask Question (FAQ)
FAQ menjawab pertanyaan pertanyaan umum terkait repository ini mulai dari naming convention, pertanyaan dasar hingga pertanyaan lanjut.
Awesome NLP Papers
This is a collection/reading-list of awesome Natural Language Processing papers sorted by date.
2018
-
Unsupervised Machine Translation Using Monolingual Corpora Only, Lample et al.
Paper
-
On the Dimensionality of Word Embeddings, Yin et al.
Paper
-
An efficient framework for learning sentence representations, Logeswaran et al.
Paper
-
Refining Pretrained Word Embeddings Using Layer-wise Relevance Propagation, Akira Utsumi
Paper
-
Domain Adapted Word Embeddings for Improved Sentiment Classification, Sarma et al.
Paper
-
In-domain Context-aware Token Embeddings Improve Biomedical Named Entity Recognition, Sheikhshab et al.
Paper
-
Generalizing Word Embeddings using Bag of Subwords, Zhao et al.
Paper
-
What's in Your Embedding, And How It Predicts Task Performance, Rogers et al.
Paper
-
On Learning Better Word Embeddings from Chinese Clinical Records: Study on Combining In-Domain and Out-Domain Data Wang et al.
Paper
-
Predicting and interpreting embeddings for out of vocabulary words in downstream tasks, Garneau et al.
Paper
-
Addressing Low-Resource Scenarios with Character-aware Embeddings, Papay et al.
Paper
-
Domain Adaptation for Disease Phrase Matching with Adversarial Networks, Liu et al.
Paper
-
Investigating Effective Parameters for Fine-tuning of Word Embeddings Using Only a Small Corpus, Komiya et al.
Paper
-
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al.
Paper
-
Adapting Word Embeddings from Multiple Domains to Symptom Recognition from Psychiatric Notes, Zhang et al.
Paper
-
Evaluation of sentence embeddings in downstream and linguistic probing tasks, Perone et al.
Paper
-
Universal Sentence Encoder, Cer et al.
Paper
-
Deep Contextualized Word Representations, Peters et al.
Paper
-
Learned in Translation: Contextualized Word Vectors, McCann et al.
Paper
-
Concatenated p-mean Word Embeddings as Universal Cross-Lingual Sentence Representations, Rücklé et al.
paper
-
A Compressed Sensing View of Unsupervised Text Embeddings, Bag-Of-n-Grams, and LSTMs, Arora et al.
Paper
2017:
-
Attention Is All You Need, Vaswani et al.
Paper
-
Skip-Gram – Zipf + Uniform = Vector Additivity, Gittens et al.
Paper
-
A Simple but Tough-to-beat Baseline for Sentence Embeddings, Arora et al.
Paper
-
Fast and Accurate Entity Recognition with Iterated Dilated Convolutions, Strubell et al.
Paper
-
Advances in Pre-Training Distributed Word Representations, Mikolov et al.
Paper
-
Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets, Dror et al.
Paper
2016:
-
Towards Universal Paraphrastic Sentence Embeddings, Wieting et al.
Paper
-
Bag of Tricks for Efficient Text Classification, Joulin et al.
Paper
-
Enriching Word Vectors with Subword Information, Bojanowski et al.
Paper
-
Assessing the Corpus Size vs. Similarity Trade-off for Word Embeddings in Clinical NLP, Kirk Roberts
Paper
-
How to Train Good Word Embeddings for Biomedical NLP, Chiu et al.
Paper
-
Log-Linear Models, MEMMs, and CRFs, Michael Collins
Paper
-
Counter-fitting Word Vectors to Linguistic Constraints, Mrkšić et al.
Paper
-
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Wu et al.
Paper
2015:
-
Semi-supervised Sequence Learning, Dai et al.
Paper
-
Evaluating distributed word representations for capturing semantics of biomedical concepts, Th et al.
Paper
2014:
-
GloVe: Global Vectors for Word Representation, Pennington et al.
Paper
-
Linguistic Regularities in Sparse and Explicit Word Representations, Levy and Goldberg.
Paper
-
Neural Word Embedding as Implicit Matrix Factorization, Levy and Goldberg.
Paper
-
word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method, Goldberg and Levy.
Paper
-
What’s in a p-value in NLP?, Søgaard et al.
Paper
-
How transferable are features in deep neural networks?, Yosinski et al.
Paper
-
Improving lexical embeddings with semantic knowledge, Yu et al.
Paper
-
Retrofitting word vectors to semantic lexicons, Faruqui et al.
Paper
2013:
-
Efficient Estimation of Word Representations in Vector Space, Mikolov et al.
Paper
-
Linguistic Regularities in Continuous Space Word Representations, Mikolov et al.
Paper
-
Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al.
Paper
2012:
- An Empirical Investigation of Statistical Significance in NLP, Berg-Kirkpatrick et al.
Paper
2010:
- Word representations: A simple and general method for semi-supervised learning, Turian et al.
Paper
2008:
- A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning, Collobert and Weston.
Paper
2006:
- Domain adaptation with structural correspondence learning, Blitzer et al.
Paper
2003:
- A Neural Probabilistic Language Model, Bengio et al.
Paper
1986:
- Distributed Representations, Hinton et al.
Paper