Home

Awesome

Awesome Indonesia NLP

Awesome

Resouse kumpulan dataset, thesis, paper, dan artikel tentang NLP (Natural Language Processing) Bahasa Indonesia. Terinpirasi oleh para pendahulu.

Daftar Isi

Memulai

Materi Pengantar NLP

Artikel-Artikel Tentang NLP

Jurnal

Dataset & Language modeling

Words dataset

  1. Word Sastrawi
  2. Word spaCy : id
  3. Word name : random-name
  4. Word Indo name : genderprediction
  5. Word Indo place : Wilayah-Administratif-Indonesia
  6. Word Indo place : Indonesia-Postal-Code
  7. Word Wiktionary : word id
  8. Word sentiment : analisis-sentimen
  9. Word sentiment : ID-OpinionWords
  10. Word sentiment : Analisis-Sentimen-ID
  11. Word Acronims
  12. word : serangkai

Sentences Dataset

  1. leipzig indonesian sentence collectoin news articles, web articles, wikipedia data from 2008-2016
  2. wn-msa.sourceforge.net Wordnet Bahasa
  3. Quran indonesian quran translation (id.muntakhab, id.jalalayn, id.indonesian)
  4. Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
  5. Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
  6. corpus-frog-storytelling spoken text story telling
  7. TED-Multilingual-Parallel-Corpus Monolingual_data/Indonesian
  8. Opus Opus NLPL
  9. Sealang Sealang dataset
  10. [Indonesian News Corpus] (https://data.mendeley.com/datasets/2zpbjs22k3/1)
  11. [INDONESIAN HOAX NEWS DETECTION DATASET] (https://data.mendeley.com/datasets/p3hfgr5j3m/1)
  12. [Warta Berita Online Kompas dan Tempo] (https://ilps.science.uva.nl/resources/bahasa/)
  13. [Raw dataset of Indonesian news articles] (https://github.com/feryandi/Dataset-Artikel)
  14. Amazon Reviews
  15. ArXiv
  16. BimaNLP

Tagged dataset

  1. NER : yohanesgultom/nlp-experiments 1700 sentences
  2. NER : yusufsyaifudin/indonesia-ner 1835 sentences
  3. POS-TAG : famrashel/idn-tagged-corpus
  4. POS-TAG : pebbie/pebahasa ~600 sentence
  5. POS-TAG Parser : UniversalDependencies/UD_Indonesian-GSD ~4477 sentence
  6. Sentimen 1506 sentences
  7. panl10n Pan Localization

Language modeling

POS tagging

  1. PANL10N POS tagging. This corpus has ~39K sentences and ~900K word tokens.
  2. IDN tagged corpus. This corpus contains ~10K sentences and ~250K word tokens. The POS tags are annotated manually.

Syntactic parsing

  1. Indonesian Treebank. This corpus contains ~1K parsed sentences. (constituency parsing)
  2. UD Indonesian. This corpus is provided by Universal Dependencies. Training, development, and testing split is already provided. (dependency parsing)

Machine translation

  1. PANL10N EN-ID news parallel corpus. This corpus has sentences from news articles from several categories: economy (6K sentences), international (6K sentences), science (6K sentences), and sport (4K sentences).
  2. PANL10N Indonesian translation of Penn treebank. This corpus contains Indonesian translation of the Penn treebank. In total there are ~24K sentences.

Speech recognition

  1. TITML-IDN speech corpus. The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances. The utterances are phonetically balanced.

    The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution. The procedure is listed here

  2. frankydotid/Indonesian-Speech-Recognition. A small corpus of 50 utterances by a single male speaker.

Automatic Summarization

Parsing

Part-of-speech Tagging

Stemming

Word Sense Disambiguation

Lain-lain

Software, Library, Kamus

Word reference (kemdikbud) link

  1. Entri Dasar : 48.748 (44,64 %)
  2. Kata Turunan : 26.312 (24,09 %)
  3. Gabungan Kata : 30.625 (28,04 %)
  4. Peribahasa : 2.040 (1,87 %)
  5. Kiasan : 268 (0,25 %)
  6. Ungkapan : 1.129 (1,03 %)
  7. Varian : 91 (0,08 %)
  8. Entri Total : 109.213 (100,00 %)
  9. Makna Total : 127.775
  10. Contoh Total : 29.495
  11. Kategori Total : 255
  12. Makna Per Entri : 1,170
  13. Contoh Per Makna : 0,231

Parallel corpus Eng-Ind

  1. parallel-corpora-en-id
  2. Indonesian-English-Bilingual-Corpus
  3. TALPCo
  4. opus
  5. Multi-Wiki

Morph

  1. MALINDO_Morph
  2. morphind
  3. INDRA

Crawler Data

  1. Crawler Indonesian news portal

Sentiment Analysis

  1. Aspect and Opinion Terms Extraction for Hotel Reviews. The corpus consists of 5000 hotel reviews from Airy (78K tokens) with 5 labels. The paper is available on arXiv.
  2. Aspect-Based Sentiment Analysis. A text classification resource for multi-label aspect categorization.

Syntactic parsing

  1. Indonesian Treebank. This corpus contains 1K parsed sentences. (constituency parsing)
  2. UD Indonesian. This corpus is provided by Universal Dependencies. Training, development, and testing split are already provided. (dependency parsing)

Machine translation

  1. PANL10N EN-ID news parallel corpus. This corpus has sentences from news articles from several categories: economy (6K sentences), international (6K sentences), science (6K sentences), and sport (4K sentences).
  2. PANL10N Indonesian translation of Penn treebank. This corpus contains Indonesian translation of the Penn treebank. In total there are 24K sentences.

Word Normalization

  1. Colloquial Indonesian Lexicon. This lexicon consists of 3592 unique colloquial tokens that are mapped onto 1742 unique lemmas. The full description of this lexicon can be seen in the paper.

Text Summarization

  1. IndoSum. A collection of 20K online news article-summary pairs belonging to 6 categories and 10 sources. It has both abstractive summaries and extractive labels.

Text Classification

  1. SMS Spam. This corpus contains 1143 sentences that have been labeled with normal message, fraud, promotion. It is provided by http://nlp.yuliadi.pro/dataset
  2. Hate Speech Detection. This dataset consists of 713 tweets in the Indonesian language with 453 non hate speech and 260 hate speech tweets.
  3. Abusive Language Detection. A collection of tweets for abusive language detection in Indonesian social media. It consists of two types of labelling, abusive/not abusive and not abusive/abusive but not offensive/offensive. It also has its own colloquial Indonesian lexicon.

Speech recognition

  1. TITML-IDN speech corpus. The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances. The utterances are phonetically balanced. The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution. The procedure is listed here.
  2. Indonesian Speech Recognition. A small corpus of 50 utterances by a single male speaker. Disclaimer: This is a school project, do not use it for any important tasks. The author is not responsible for the undesired results of using the data provided here.
  3. CMU Wilderness Multilingual Speech Dataset. A dataset of over 700 different languages providing audio, aligned texts, and word pronunciations. One of the languages is Indonesian. The utterances are read from the bible, which is recorded by bible.is.

Free Books

Courses

  1. Natural Language Processing - Coursera
  2. Nautral Language Processing - Edx
  3. Oxford CS Deep NLP

Videos and Lectures

  1. 2016 CS224D Deep Learning For Natural Language Processing Lecture Videos
  2. Natural Language Processing

Papers

  1. Breaking Sticks and Ambiguities with Adaptive Skip-gram
  2. Distributed Representations of Words and Phrases and their Compositionality
  3. Learning the Dimensionality of Word Embeddings
  4. Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols
  5. Skip Thought Vectors

Tutorials

  1. Natural Language Processing
  2. Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK
  3. Multi-Class Classification Tutorial with the Keras Deep Learning Library
  4. Topic Modeling with Scikit Learn
  5. Data Science with Python & R: Sentiment Classification Using Linear Methods

Sample Code

  1. Sentiment
  2. Prediksi Gender Nama
  3. Topic Modeling
  4. POS Tagging NLTK (Bahasa Indonesia)
  5. Naive Bayes Document Classifier (Bahasa Indonesia)

Datasets

Libraries

  1. NLTK
  2. Gensim
  3. TextBlob
  4. Spacy
  5. Sastrawi
  6. Nalapa
  7. Polyglot

Contributing

Jika ingin berkontribusi dalam github ini, sangat disarankan untuk Pull Request namun dengan resource berbahasa indonesia.

Frequently Ask Question (FAQ)

FAQ menjawab pertanyaan pertanyaan umum terkait repository ini mulai dari naming convention, pertanyaan dasar hingga pertanyaan lanjut.

Awesome NLP Papers

This is a collection/reading-list of awesome Natural Language Processing papers sorted by date.

2018

2017:

2016:

2015:

2014:

2013:

2012:

2010:

2008:

2006:

2003:

1986: