Awesome
Thai NLP Resource
Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus. Always welcome for pull requests.
Libraries/Services
Thai Character Cluster
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
JTCC | Thai Character Cluster | Java | GPL-3.0 | Wittawat | |
TCC | Thai Character Cluster | Python | Apache 2.0 | Wannaphong |
Sentiment Analysis
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
sentiment_analysis_thai | JagerV3 |
Soundex
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
PyThaiNLP | Python 3 | LK82 + Udom83 | Apache 2.0 | Korakot, GitHub |
Word Segmentation
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Chamkho | Lao/Thai word segmentation | Rust | LGPL | GitHub | |
CutKum | Thai word segmentation with Deep Learning in Tensorflow. RNN. | Python | 93% F-measure. | MIT | Pucktada, GitHub |
CutThai | Thai word segmentation written in coffee-script Edit | Coffee-script | MIT | Pureexe/cutthai GitHub | |
DeepCut | A Thai word tokenization library using Deep Neural Network. CNN. | Python | 98.8% F-measure. | MIT | rkcosmos, GitHub |
Lexto: Thai Lexeme Tokenizer | Java | LGPL | NECTEC | ||
Lexto | Python 2 | LGPL | GitHub | ||
Lexto | Python 3 | LGPL | GitHub | ||
Multi-Candidate-Word-Segmentation | Multi Candidate Word Segmentation for Thai language | Python, RNN, LSTM | 97.0% F-measure (Word Level), 98.95% F-measure (Boundary Level) | MIT | paper, GitHub |
PyThaiNLP | Python 3 | Maximal matching and various other engines | Apache 2.0 | GitHub | |
Swath | SWATH (Smart Word Analysis for THai) is a word segmentation for Thai | C | Longest Matching, Maximal Matching and Part-of-Speech Bigram. | GPL | Paisarn Charoenpornsawat, CMU |
SynThai | Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. | Python | 99.2% F-measure | MIT | KenjiroAI, GitHub |
Thai Language Toolkit (tltk) | Based on a paper by Wirote Aroonmanakun in 2002. Word segmentation is based on a maximum collocation approach. Syllable segmentation is based on 3grams statistics. (Dataset is included) | Python | 97.86% F-measure. (It was tested on a different testset; it is not fair to compare it with other models.) | GPLv3 | PyPI |
Wordcut | Thai word breaker for Node.js | JavaScript, Node.JS | LGPL-3.0 | veer66, GitHub | |
wordcutpy | A simple Thai word tokenizer written in 1 Python file | Python 3 | LGPL-3.0 | veer66, GitHub |
Part of Speech Tagging (POS Tagging)
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Chart-POS | Thai POS Tagger | C | All rights reserved | AIAT, KINDML, Thanaruk T. (thanaruk@siit.tu.ac.th), tchayintr, Demo at iApp | |
Jitar+NAiST | A simple Trigram HMM part-of-speech tagger | Java | Ver66, Jitar + NAiST, 1 + NAiST, 2 | ||
SynThai | Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. | Python | 0.9163 F-measure. RNN. LSTM | MIT | KenjiroAI, github |
Name Entity Recognition
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Named Entity Tagging (Thai NEST) | Thai Named Entity tagging Specification and Tools | GPL | KINDML, SIIT, AIAT | ||
ThaiNER | Thai Named Entity Recognition for PyThaiNLP | Python | Apache 2.0 (code) & CC BY 3.0 (Dataset) | ThaiNER |
News Structure Tagging
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
News Structure Tagging Program | Thai News Structure Tagging Program | Metadata tagging, Structure tagging, Automatic News Title Generation | GPL | AIAT |
Syntactic Parsing & Tools
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Chart-parser | Extract Syntactic Structure from POS Tagged Sentence. | C | All rights reserved | AIAT, KINDML, Thanaruk T. (thanaruk@siit.tu.ac.th), tchayintr, Demo at iApp | |
Grammar Processing | Labelled Brackets -> Context Free Grammars (CFGs) | Python | Transform and compute probability | tchayintr |
Word Embedding
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
kobkrit-word-embedding | Tensorflow implementation of Thai word embedding | Python | Source code, Example, Word distance graph | LGPL | Kobkrit V. |
Question Answering (Machine Comprehension)
Service | Description | License | Author & Link |
---|---|---|---|
Thai Machine Comprehension (ThaiMC) | Bidirectional Attention Flow | Copyright (As the service) | iApp-AI |
Emojification
Service | Description | License | Author & Link |
---|---|---|---|
Thai Emotification | LSTM | GPL | Demo at iApp-AI and Source, Github |
Corpus and Dataset
Dictionaries / Translation Pairs
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
LEXiTRON | Thai<->English Dictionary | TH->EN, EN->TH | LEXiTRON License | NECTEC | |
Transliteration Corpus | 31K pairs | Thai-Eng Translation Pair | CC BY-NC-SA 3.0 TH | NECTEC | |
Yaitron | LEXiTRON in machine readable format (XML) | TH->EN, EN->TH | LEXiTRON License | Veer66 Schema, Data & Conversion Code |
Downloadable Text Corpus
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
Click Bait Sentences | Thai Click Bait Sentence | 330 sent. (90.7KB) | MIT | Wannaphongcom | |
InterBEST 2009/2010 | 5M words | Word Seg. | CC BY-NC-SA 3.0 TH | NECTEC | |
ORCHID | 30K sent. | Word Seg., POS Tagged. | CC BY-NC-SA 3.0 TH | NECTEC | |
Prime Minister 29 | Prime Minister 29's Speech Sentences | 338KB | Word segged, Name Entity Tagged | MIT | Wannaphongcom |
thai-jokes-corpus | Cleaned Thai Jokes Corpus | 457 jokes | GPLv3 | iApp Technology | |
Thai named entity corpora | named entity corpora by Wirote Aroonmanakun's students | 266KB-1.5MB | syllable seg., word seg., Named Entity tagged | GPLv3 (not sure, but tltk is using this license) | นัชชา ถิระสาโรช Data<br /> ศศิวิมล กาลันสีมา Data<br /> ณัฐดาพร เลิศชีวะ Data |
THAI-NEST | Thai-NEST: Thai Named Entity tagging Specification and Tools | 45K+ Name Entity Token | Name Entity Tagged | LGPL | KINDML |
Thai Sentimental Word List | Thai Sentimental Words List | 52KB | Seperated Words as Adj, V | MIT | Wannaphongcom |
Thai Wikipedia | Formal Articles | 1.49GB (~213.1 MB compressed) | XML | GFDL | WIKIPEDIA |
Thai WordNet | THE CONSTRUCTION OF THAI WORDNET OF 1ST ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD AND WITH DICTIONARIES OF DIFFERENT COMPILATIONAL APPROACHES(ธนนท์ หลีน้อย) <br /> <br /> THE CONSTRUCTION OF THAI WORDNET OF 2ND ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD : A STUDY OF THE DIVERSITY OF MEANINGS AFFECTING TRANSLATIONAL ACCURACY (ปริศนา อัครพุทธิพร) | WordNet | N/A | ธนนท์ หลีน้อย 2008<br />ปริศนา อัครพุทธิพร Data 2008 | |
TNC Top-5000 Words | Word frequency | 5,000 words | Frequency of Thai words in various genres, EXCEL | All rights reserved | CHULA |
Toxicity in Thai Tweet Corpus | Tokyo Metropolitan University Natural Language Processing Group | Each tweet is labeled as toxic or non-toxic | CC BY-NC 4.0 | tmu-nlp | |
Wisesight Sentiment Corpus | Social media message with sentiment label (positive, neutral, negative, question). | ~26,700 messages | Sentiment label, Question label | Public domain | PyThaiNLP |
Web Query Text Corpus
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
Thai National Corpus 2 | 32M words | Query text by genre, domain | All rights reserved | CHULA | |
Thai Medical Document | 3,594 docs | Document and dynamic keyword map | All rights reserved | KINDML, SIIT | |
Southeast Asian Languages Library | Thai News, Web Text, Pop Music, Literature, Toponyms | 20M chars | Phase around a search text | SEALang | |
HSE Thai Corpus | Modern texts written in Thai language (mostly news websites) | 50M tokens | Query by word form, lexeme, translation, grammatical attributes, lexical attributees | HSE School of Linguistics |
Parallel Corpus
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
TALPCo | TUFS Asian Language Parallel Corpus | 1327 sent | open parallel corpus consisting of Japanese sentences and their translations into Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English | CC BY 4.0 | TALPCo |
Pre-trained Language Models
Pre-trained Model | Description | Size | Dimensions | License | Link |
---|---|---|---|---|---|
fastText | Skip-Gram model trained on Wikipedia using fastText | 300 | CC BY-SA 3.0 | Facebook + Bin & Text + Text Only | |
thai2fit | ULMFit on Wikipedia. Perplexity of 46.80959 with 60,002 embeddings. | 70MB | 300 | MIT | thai2vec / PyThaiNLP |
thbert | Yet another pre-trained BERT particularly in Thai | Apache 2.0 | tchayintr |
Benchmarks
Thai Text Classification Benchmarks
- wongnai-corpus
- prachathai-67k
- wisesight-sentiment
- truevoice-intent: destination
Tools
Corpus extractors
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
BEST2010 cooker | A tool for extracting segmented words from Thai segmented BEST2010 corpus | Python3 | Extracting segmented words, features, and data divisions | Apache 2.0 | tchayintr |