Awesome

Thai NLP Resource

Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus. Always welcome for pull requests.

Libraries/Services

Thai Character Cluster

Library	Description	Programming Languages	Features	License	Author & Link
JTCC	Thai Character Cluster	Java		GPL-3.0	Wittawat
TCC	Thai Character Cluster	Python		Apache 2.0	Wannaphong

Sentiment Analysis

Library	Description	Programming Languages	Features	License	Author & Link
sentiment_analysis_thai					JagerV3

Soundex

Library	Description	Programming Languages	Features	License	Author & Link
PyThaiNLP	Python 3	LK82 + Udom83	Apache 2.0	Korakot, GitHub

Word Segmentation

Library	Description	Programming Languages	Features	License	Author & Link
Chamkho	Lao/Thai word segmentation	Rust	LGPL	GitHub
CutKum	Thai word segmentation with Deep Learning in Tensorflow. RNN.	Python	93% F-measure.	MIT	Pucktada, GitHub
CutThai	Thai word segmentation written in coffee-script Edit	Coffee-script		MIT	Pureexe/cutthai GitHub
DeepCut	A Thai word tokenization library using Deep Neural Network. CNN.	Python	98.8% F-measure.	MIT	rkcosmos, GitHub
Lexto: Thai Lexeme Tokenizer	Java		LGPL	NECTEC
Lexto	Python 2		LGPL	GitHub
Lexto	Python 3		LGPL	GitHub
Multi-Candidate-Word-Segmentation	Multi Candidate Word Segmentation for Thai language	Python, RNN, LSTM	97.0% F-measure (Word Level), 98.95% F-measure (Boundary Level)	MIT	paper, GitHub
PyThaiNLP	Python 3	Maximal matching and various other engines	Apache 2.0	GitHub
Swath	SWATH (Smart Word Analysis for THai) is a word segmentation for Thai	C	Longest Matching, Maximal Matching and Part-of-Speech Bigram.	GPL	Paisarn Charoenpornsawat, CMU
SynThai	Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.	Python	99.2% F-measure	MIT	KenjiroAI, GitHub
Thai Language Toolkit (tltk)	Based on a paper by Wirote Aroonmanakun in 2002. Word segmentation is based on a maximum collocation approach. Syllable segmentation is based on 3grams statistics. (Dataset is included)	Python	97.86% F-measure. (It was tested on a different testset; it is not fair to compare it with other models.)	GPLv3	PyPI
Wordcut	Thai word breaker for Node.js	JavaScript, Node.JS		LGPL-3.0	veer66, GitHub
wordcutpy	A simple Thai word tokenizer written in 1 Python file	Python 3		LGPL-3.0	veer66, GitHub

Part of Speech Tagging (POS Tagging)

Library	Description	Programming Languages	Features	License	Author & Link
Chart-POS	Thai POS Tagger	C		All rights reserved	AIAT, KINDML, Thanaruk T. (thanaruk@siit.tu.ac.th), tchayintr, Demo at iApp
Jitar+NAiST	A simple Trigram HMM part-of-speech tagger	Java			Ver66, Jitar + NAiST, 1 + NAiST, 2
SynThai	Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.	Python	0.9163 F-measure. RNN. LSTM	MIT	KenjiroAI, github

Name Entity Recognition

Library	Description	Programming Languages	Features	License	Author & Link
Named Entity Tagging (Thai NEST)	Thai Named Entity tagging Specification and Tools			GPL	KINDML, SIIT, AIAT
ThaiNER	Thai Named Entity Recognition for PyThaiNLP	Python		Apache 2.0 (code) & CC BY 3.0 (Dataset)	ThaiNER

News Structure Tagging

Library	Description	Programming Languages	Features	License	Author & Link
News Structure Tagging Program	Thai News Structure Tagging Program		Metadata tagging, Structure tagging, Automatic News Title Generation	GPL	AIAT

Syntactic Parsing & Tools

Library	Description	Programming Languages	Features	License	Author & Link
Chart-parser	Extract Syntactic Structure from POS Tagged Sentence.	C		All rights reserved	AIAT, KINDML, Thanaruk T. (thanaruk@siit.tu.ac.th), tchayintr, Demo at iApp
Grammar Processing	Labelled Brackets -> Context Free Grammars (CFGs)	Python	Transform and compute probability		tchayintr

Word Embedding

Library	Description	Programming Languages	Features	License	Author & Link
kobkrit-word-embedding	Tensorflow implementation of Thai word embedding	Python	Source code, Example, Word distance graph	LGPL	Kobkrit V.

Question Answering (Machine Comprehension)

Service	Description	License	Author & Link
Thai Machine Comprehension (ThaiMC)	Bidirectional Attention Flow	Copyright (As the service)	iApp-AI

Emojification

Service	Description	License	Author & Link
Thai Emotification	LSTM	GPL	Demo at iApp-AI and Source, Github

Corpus and Dataset

Dictionaries / Translation Pairs

Library	Description	Size	Features	License	Link
LEXiTRON	Thai<->English Dictionary		TH->EN, EN->TH	LEXiTRON License	NECTEC
Transliteration Corpus		31K pairs	Thai-Eng Translation Pair	CC BY-NC-SA 3.0 TH	NECTEC
Yaitron	LEXiTRON in machine readable format (XML)		TH->EN, EN->TH	LEXiTRON License	Veer66 Schema, Data & Conversion Code

Downloadable Text Corpus

Library	Description	Size	Features	License	Link
Click Bait Sentences	Thai Click Bait Sentence	330 sent. (90.7KB)		MIT	Wannaphongcom
InterBEST 2009/2010		5M words	Word Seg.	CC BY-NC-SA 3.0 TH	NECTEC
ORCHID		30K sent.	Word Seg., POS Tagged.	CC BY-NC-SA 3.0 TH	NECTEC
Prime Minister 29	Prime Minister 29's Speech Sentences	338KB	Word segged, Name Entity Tagged	MIT	Wannaphongcom
thai-jokes-corpus	Cleaned Thai Jokes Corpus	457 jokes		GPLv3	iApp Technology
Thai named entity corpora	named entity corpora by Wirote Aroonmanakun's students	266KB-1.5MB	syllable seg., word seg., Named Entity tagged	GPLv3 (not sure, but tltk is using this license)	นัชชา ถิระสาโรช Data<br /> ศศิวิมล กาลันสีมา Data<br /> ณัฐดาพร เลิศชีวะ Data
THAI-NEST	Thai-NEST: Thai Named Entity tagging Specification and Tools	45K+ Name Entity Token	Name Entity Tagged	LGPL	KINDML
Thai Sentimental Word List	Thai Sentimental Words List	52KB	Seperated Words as Adj, V	MIT	Wannaphongcom
Thai Wikipedia	Formal Articles	1.49GB (~213.1 MB compressed)	XML	GFDL	WIKIPEDIA
Thai WordNet	THE CONSTRUCTION OF THAI WORDNET OF 1ST ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD AND WITH DICTIONARIES OF DIFFERENT COMPILATIONAL APPROACHES(ธนนท์ หลีน้อย) <br /> <br /> THE CONSTRUCTION OF THAI WORDNET OF 2ND ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD : A STUDY OF THE DIVERSITY OF MEANINGS AFFECTING TRANSLATIONAL ACCURACY (ปริศนา อัครพุทธิพร)		WordNet	N/A	ธนนท์ หลีน้อย 2008<br />ปริศนา อัครพุทธิพร Data 2008
TNC Top-5000 Words	Word frequency	5,000 words	Frequency of Thai words in various genres, EXCEL	All rights reserved	CHULA
Toxicity in Thai Tweet Corpus	Tokyo Metropolitan University Natural Language Processing Group		Each tweet is labeled as toxic or non-toxic	CC BY-NC 4.0	tmu-nlp
Wisesight Sentiment Corpus	Social media message with sentiment label (positive, neutral, negative, question).	~26,700 messages	Sentiment label, Question label	Public domain	PyThaiNLP

Web Query Text Corpus

Library	Description	Size	Features	License	Link
Thai National Corpus 2		32M words	Query text by genre, domain	All rights reserved	CHULA
Thai Medical Document		3,594 docs	Document and dynamic keyword map	All rights reserved	KINDML, SIIT
Southeast Asian Languages Library	Thai News, Web Text, Pop Music, Literature, Toponyms	20M chars	Phase around a search text		SEALang
HSE Thai Corpus	Modern texts written in Thai language (mostly news websites)	50M tokens	Query by word form, lexeme, translation, grammatical attributes, lexical attributees		HSE School of Linguistics

Parallel Corpus

Library	Description	Size	Features	License	Link
TALPCo	TUFS Asian Language Parallel Corpus	1327 sent	open parallel corpus consisting of Japanese sentences and their translations into Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English	CC BY 4.0	TALPCo

Pre-trained Language Models

Pre-trained Model	Description	Size	Dimensions	License	Link
fastText	Skip-Gram model trained on Wikipedia using fastText		300	CC BY-SA 3.0	Facebook + Bin & Text + Text Only
thai2fit	ULMFit on Wikipedia. Perplexity of 46.80959 with 60,002 embeddings.	70MB	300	MIT	thai2vec / PyThaiNLP
thbert	Yet another pre-trained BERT particularly in Thai			Apache 2.0	tchayintr

Benchmarks

Thai Text Classification Benchmarks

Tools

Corpus extractors

Library	Description	Programming Languages	Features	License	Author & Link
BEST2010 cooker	A tool for extracting segmented words from Thai segmented BEST2010 corpus	Python3	Extracting segmented words, features, and data divisions	Apache 2.0	tchayintr

Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)

https://resources.aiat.or.th/

Acknowledgements

bact - For suggestions on license words.
C4N
Veer66
Bi89
Tchayintr
PureEXE
Cstorm125
Wannaphongcom
Ekapolc