Awesome
NLP Bahasa Indonesia Resources
This repository provides link to useful dataset and another resources for NLP in Bahasa Indonesia.
Last Update: 15 Mar 2022
Table of contents
- Corpus
- Dictionary
- Articles and Papers
- Pre-trained Models
- Usable Library
- Spelling Correction
- Twitter Scraping
- Other Resources
Corpus
Named Entity Recognition
- Product NER. https://github.com/dziem/proner-labeled-text
- NER-grit. https://github.com/grit-id/nergrit-corpus
POS-Tagging
- IDN Tagged Corpus. https://github.com/famrashel/idn-tagged-corpus
- Indonesian Part-of-Speech (POS) Tagging. https://github.com/kmkurn/id-pos-tagging/blob/master/data/dataset.tar.gz
Question and Answering
Paraphrasing
- Quora Paraphrasing. https://github.com/louisowen6/quora_paraphrasing_id
- Paraphrase Adversaries from Word Scrambling. https://github.com/Wikidepia/indonesian_datasets/tree/master/paraphrase/paws
Text Summarization
- Indosum. https://github.com/kata-ai/indosum
- Liputan6. https://huggingface.co/datasets/id_liputan6
Hate-speech
- ID Multi Label Hate Speech. https://github.com/okkyibrohim/id-multi-label-hate-speech-and-abusive-language-detection
Word Analogy
Formal-Informal
- STIF-Indonesia. https://github.com/haryoa/stif-indonesia
- IndoCollex. https://github.com/haryoa/indo-collex
- https://github.com/okkyibrohim/id-multi-label-hate-speech-and-abusive-language-detection/blob/master/new_kamusalay.csv
Multilingual Parallel
- https://huggingface.co/datasets/alt
- https://opus.nlpl.eu/bible-uedin.php
- http://www.statmt.org/cc-aligned/
- https://huggingface.co/datasets/id_panl_bppt
- https://huggingface.co/datasets/open_subtitles
- https://huggingface.co/datasets/opus100
- https://huggingface.co/datasets/tapaco
- https://huggingface.co/datasets/wiki_lingua
Unsupervised Corpus
- OSCAR. https://oscar-corpus.com/
- Online Newspaper. https://github.com/feryandi/Dataset-Artikel
- IndoNLU. https://huggingface.co/datasets/indonlu
- IndoNLG. https://github.com/indobenchmark/indonlg
- IndoNLI. https://github.com/ir-nlp-csui/indonli
- IndoBERTweet. https://github.com/indolem/IndoBERTweet
- http://data.statmt.org/cc-100/
- https://huggingface.co/datasets/id_clickbait
- https://huggingface.co/datasets/id_newspapers_2018
- https://opus.nlpl.eu/QED.php
Voice-Text
Puisi and Pantun
Dictionary
Synonym
Sentiment
- (Negative) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/negatif_ta2.txt
- (Negative) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/negative_add.txt
- (Negative) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/negative_keyword.txt
- (Negative) https://github.com/masdevid/ID-OpinionWords/blob/master/negative.txt
- (Positive) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/positif_ta2.txt
- (Positive) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/positive_add.txt
- (Positive) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/positive_keyword.txt
- (Positive) https://github.com/masdevid/ID-OpinionWords/blob/master/positive.txt
- (Score) https://github.com/agusmakmun/SentiStrengthID/blob/master/id_dict/sentimentword.txt
- (InSet Lexicon) https://github.com/fajri91/InSet [Paper]
- (Twitter Labelled Sentiment) https://www.researchgate.net/profile/Ridi_Ferdiana/publication/339936724_Indonesian_Sentiment_Twitter_Dataset/data/5e6d64c6a6fdccf994ca18aa/Indonesian-Sentiment-Twitter-Dataset.zip?origin=publicationDetail_linkedData [Paper]
- https://huggingface.co/datasets/senti_lex
Position or Degree
- https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/psuf.txt
- https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/lldr.txt
- https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/opos.txt
- https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/ptit.txt
Root Words
- https://github.com/agusmakmun/SentiStrengthID/blob/master/id_dict/rootword.txt
- https://github.com/sastrawi/sastrawi/blob/master/data/kata-dasar.original.txt
- https://github.com/sastrawi/sastrawi/blob/master/data/kata-dasar.txt
- https://github.com/prasastoadi/serangkai/blob/master/serangkai/kamus/data/kamus-kata-dasar.csv
I have made the combined root words list from all of the above repositories.
Slang Words
- https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/kbba.txt
- https://github.com/agusmakmun/SentiStrengthID/blob/master/id_dict/slangword.txt
- https://github.com/panggi/pujangga/blob/master/resource/formalization/formalizationDict.txt
I have made the combined slang words dictionary from all of the above repositories.
Stop Words
- https://github.com/yasirutomo/python-sentianalysis-id/blob/master/data/feature_list/stopwordsID.txt
- https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/stopword.txt
- https://github.com/abhimantramb/elang/tree/master/word2vec/utils/stopwords-list
I have made the combined stop words list from all of the above repositories.
Swear Words
Composite Words
Number Words
Calendar Words
Emoticon
- https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/emoticon.txt
- https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-id.txt
- https://github.com/agusmakmun/SentiStrengthID/blob/master/id_dict/emoticon.txt
Acronym
- https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/acronym.txt
- https://github.com/panggi/pujangga/blob/master/resource/sentencedetector/acronym.txt
- https://id.wiktionary.org/wiki/Lampiran:Daftar_singkatan_dan_akronim_dalam_bahasa_Indonesia#A
Indonesia Region
- https://github.com/abhimantramb/elang/blob/master/word2vec/utils/indonesian-region.txt
- https://github.com/edwardsamuel/Wilayah-Administratif-Indonesia/tree/master/csv
- https://github.com/pentagonal/Indonesia-Postal-Code/tree/master/Csv
Country
Region
Title of Name
Gender by Name
Organization
Articles and Papers
POS-Tagging
- https://medium.com/@puspitakaban/pos-tagging-bahasa-indonesia-dengan-flair-nlp-c12e45542860
- Manually Tagged Indonesian Corpus [Paper] [GitHub]
Word Embedding
- (FastText). https://structilmy.com/2019/08/membuat-model-word-embedding-fasttext-bahasa-indonesia/
- (Word2Vec). https://yudiwbs.wordpress.com/2018/03/31/word2vec-wikipedia-bahasa-indonesia-dengan-python-gensim/
Topic Analysis
- (Introduction to LSA & LDA). https://monkeylearn.com/blog/introduction-to-topic-modeling/
- (Introduction to LDA w/ Code & Tips). https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
- (Topic Modeling Methods Comparison Paper). https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf
- (Original LDA Paper). http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
- (LDA Python Library). https://pypi.org/project/lda/; https://radimrehurek.com/gensim/models/ldamodel.html; https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
- (Original CTM Paper). http://people.ee.duke.edu/~lcarin/Blei2005CTM.pdf
- (CTM Python Library). https://pypi.org/project/tomotopy/; https://github.com/kzhai/PyCTM
- (Gaussian LDA Paper). https://www.aclweb.org/anthology/P15-1077.pdf
- (Gaussian LDA Library). https://github.com/rajarshd/Gaussian_LDA
- (Temporal Topic Modeling Comparison Paper). https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf
- (TOT: A Non-Markov Continuous-Time Model of Topical Trends Paper). https://people.cs.umass.edu/~mccallum/papers/tot-kdd06s.pdf
- (TOT Library). https://github.com/ahmaurya/topics_over_time
- (Example of LDA in Bahasa Project Code). https://github.com/kirralabs/text-clustering
Text Classification
Zero-shot Learning
- (Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach) https://arxiv.org/pdf/1909.00161.pdf | https://github.com/yinwenpeng/BenchmarkingZeroShot
- (Integrating Semantic Knowledge to Tackle Zero-shot Text Classification) https://arxiv.org/abs/1903.12626 | https://github.com/JingqingZ/KG4ZeroShotText
- (Train Once, Test Anywhere: Zero-Shot Learning for Text Classification) https://arxiv.org/abs/1712.05972 | https://amitness.com/2020/05/zero-shot-text-classification/
- (Zero-shot Text Classification With Generative Language Models) https://arxiv.org/abs/1912.10165 | https://amitness.com/2020/06/zero-shot-classification-via-generation/
- (Zero-shot User Intent Detection via Capsule Neural Networks) https://arxiv.org/abs/1809.00385 | https://github.com/congyingxia/ZeroShotCapsule
Few-shot Learning
- (Few-shot Text Classification with Distributional Signatures) https://arxiv.org/pdf/1908.06039.pdf | https://github.com/YujiaBao/Distributional-Signatures
- (Few Shot Text Classification with a Human in the Loop) https://katbailey.github.io/talks/Few-shot%20text%20classification.pdf | https://github.com/katbailey/few-shot-text-classification
- (Induction Networks for Few-Shot Text Classification) https://arxiv.org/pdf/1902.10482v2.pdf | https://github.com/zhongyuchen/few-shot-learning
Pre-trained Models
- Indo-BERT. https://github.com/indobenchmark/indonlu & https://huggingface.co/indobenchmark/indobert-base-p1
- Indo-BERTweet. https://github.com/indolem/IndoBERTweet & https://huggingface.co/indolem/indobertweet-base-uncased
- Transformer-based Pre-trained Model in Bahasa. https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers
- Generate Word-Embedding / Sentence-Embedding using pre-Trained Multilingual Bert model. (https://colab.research.google.com/drive/1yFphU6PW9Uo6lmDly_ud9a6c4RCYlwdX#scrollTo=Zn0n2S-FWZih). P.S: Just change the model using 'bert-base-multilingual-uncased'
- https://github.com/meisaputri21/Indonesian-Twitter-Emotion-Dataset. [Paper]
- https://github.com/Kyubyong/wordvectors
- https://drive.google.com/uc?id=0B5YTktu2dOKKNUY1OWJORlZTcUU&export=download
- https://github.com/deryrahman/word2vec-bahasa-indonesia
- https://sites.google.com/site/rmyeid/projects/polyglot
Usable Library
- Pujangga: Indonesian Natural Language Processing REST API. https://github.com/panggi/pujangga
- Sastrawi Stemmer Bahasa Indonesia. https://github.com/sastrawi/sastrawi
- NLP-ID. https://github.com/kumparan/nlp-id
- MorphInd: Indonesian Morphological Analyzer. http://septinalarasati.com/morphind/
- INDRA: Indonesian Resource Grammar. https://github.com/davidmoeljadi/INDRA
- Typo Checker. https://github.com/mamat-rahmat/checker_id
- Multilingual NLP Package. https://github.com/flairNLP/flair
- spaCy [GitHub] [Tutorial]
- https://github.com/yohanesgultom/nlp-experiments
- https://github.com/yasirutomo/python-sentianalysis-id
- https://github.com/riochr17/Analisis-Sentimen-ID
- https://github.com/yusufsyaifudin/indonesia-ner
Spelling Correction
You can adjust this code with Bahasa corpus to do the spelling correction
Twitter Scraping
- GetOldTweets3. https://github.com/Mottl/GetOldTweets3
Usage:
import GetOldTweets3 as got
tweetCriteria=got.manager.TweetCriteria().setQuerySearch('#CoronaVirusIndonesia').setSince("2020-01-01").setUntil("2020-03-05").setNear("Jakarta, Indonesia").setLang("id")
tweets=got.manager.TweetManager.getTweets(tweetCriteria)
for tweet in tweets:
print(tweet.username)
print(tweet.text)
print(tweet.date)
print("tweet.to")
print("tweet.retweets")
print("tweet.favorites")
print("tweet.mentions")
print("tweet.hashtags")
print("tweet.geo")
Step-by-step how to use Tweepy. https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1
Sign in to Twitter Developer. https://developer.twitter.com/en
Full List of Tweets Object. https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object
Increasing Tweepy’s standard API search limit. https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./