Home

Awesome

Thai NLP Resource

Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus. Always welcome for pull requests.

Libraries/Services

Thai Character Cluster

LibraryDescriptionProgramming LanguagesFeaturesLicenseAuthor & Link
JTCCThai Character ClusterJavaGPL-3.0Wittawat
TCCThai Character ClusterPythonApache 2.0Wannaphong

Sentiment Analysis

LibraryDescriptionProgramming LanguagesFeaturesLicenseAuthor & Link
sentiment_analysis_thaiJagerV3

Soundex

LibraryDescriptionProgramming LanguagesFeaturesLicenseAuthor & Link
PyThaiNLPPython 3LK82 + Udom83Apache 2.0Korakot, GitHub

Word Segmentation

LibraryDescriptionProgramming LanguagesFeaturesLicenseAuthor & Link
ChamkhoLao/Thai word segmentationRustLGPLGitHub
CutKumThai word segmentation with Deep Learning in Tensorflow. RNN.Python93% F-measure.MITPucktada, GitHub
CutThaiThai word segmentation written in coffee-script EditCoffee-scriptMITPureexe/cutthai GitHub
DeepCutA Thai word tokenization library using Deep Neural Network. CNN.Python98.8% F-measure.MITrkcosmos, GitHub
Lexto: Thai Lexeme TokenizerJavaLGPLNECTEC
LextoPython 2LGPLGitHub
LextoPython 3LGPLGitHub
Multi-Candidate-Word-SegmentationMulti Candidate Word Segmentation for Thai languagePython, RNN, LSTM97.0% F-measure (Word Level), 98.95% F-measure (Boundary Level)MITpaper, GitHub
PyThaiNLPPython 3Maximal matching and various other enginesApache 2.0GitHub
SwathSWATH (Smart Word Analysis for THai) is a word segmentation for ThaiCLongest Matching, Maximal Matching and Part-of-Speech Bigram.GPLPaisarn Charoenpornsawat, CMU
SynThaiThai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.Python99.2% F-measureMITKenjiroAI, GitHub
Thai Language Toolkit (tltk)Based on a paper by Wirote Aroonmanakun in 2002. Word segmentation is based on a maximum collocation approach. Syllable segmentation is based on 3grams statistics. (Dataset is included)Python97.86% F-measure. (It was tested on a different testset; it is not fair to compare it with other models.)GPLv3PyPI
WordcutThai word breaker for Node.jsJavaScript, Node.JSLGPL-3.0veer66, GitHub
wordcutpyA simple Thai word tokenizer written in 1 Python filePython 3LGPL-3.0veer66, GitHub

Part of Speech Tagging (POS Tagging)

LibraryDescriptionProgramming LanguagesFeaturesLicenseAuthor & Link
Chart-POSThai POS TaggerCAll rights reservedAIAT, KINDML, Thanaruk T. (thanaruk@siit.tu.ac.th), tchayintr, Demo at iApp
Jitar+NAiSTA simple Trigram HMM part-of-speech taggerJavaVer66, Jitar + NAiST, 1 + NAiST, 2
SynThaiThai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.Python0.9163 F-measure. RNN. LSTMMITKenjiroAI, github

Name Entity Recognition

LibraryDescriptionProgramming LanguagesFeaturesLicenseAuthor & Link
Named Entity Tagging (Thai NEST)Thai Named Entity tagging Specification and ToolsGPLKINDML, SIIT, AIAT
ThaiNERThai Named Entity Recognition for PyThaiNLPPythonApache 2.0 (code) & CC BY 3.0 (Dataset)ThaiNER

News Structure Tagging

LibraryDescriptionProgramming LanguagesFeaturesLicenseAuthor & Link
News Structure Tagging ProgramThai News Structure Tagging ProgramMetadata tagging, Structure tagging, Automatic News Title GenerationGPLAIAT

Syntactic Parsing & Tools

LibraryDescriptionProgramming LanguagesFeaturesLicenseAuthor & Link
Chart-parserExtract Syntactic Structure from POS Tagged Sentence.CAll rights reservedAIAT, KINDML, Thanaruk T. (thanaruk@siit.tu.ac.th), tchayintr, Demo at iApp
Grammar ProcessingLabelled Brackets -> Context Free Grammars (CFGs)PythonTransform and compute probabilitytchayintr

Word Embedding

LibraryDescriptionProgramming LanguagesFeaturesLicenseAuthor & Link
kobkrit-word-embeddingTensorflow implementation of Thai word embeddingPythonSource code, Example, Word distance graphLGPLKobkrit V.

Question Answering (Machine Comprehension)

ServiceDescriptionLicenseAuthor & Link
Thai Machine Comprehension (ThaiMC)Bidirectional Attention FlowCopyright (As the service)iApp-AI

Emojification

ServiceDescriptionLicenseAuthor & Link
Thai EmotificationLSTMGPLDemo at iApp-AI and Source, Github

Corpus and Dataset

Dictionaries / Translation Pairs

LibraryDescriptionSizeFeaturesLicenseLink
LEXiTRONThai<->English DictionaryTH->EN, EN->THLEXiTRON LicenseNECTEC
Transliteration Corpus31K pairsThai-Eng Translation PairCC BY-NC-SA 3.0 THNECTEC
YaitronLEXiTRON in machine readable format (XML)TH->EN, EN->THLEXiTRON LicenseVeer66 Schema, Data & Conversion Code

Downloadable Text Corpus

LibraryDescriptionSizeFeaturesLicenseLink
Click Bait SentencesThai Click Bait Sentence330 sent. (90.7KB)MITWannaphongcom
InterBEST 2009/20105M wordsWord Seg.CC BY-NC-SA 3.0 THNECTEC
ORCHID30K sent.Word Seg., POS Tagged.CC BY-NC-SA 3.0 THNECTEC
Prime Minister 29Prime Minister 29's Speech Sentences338KBWord segged, Name Entity TaggedMITWannaphongcom
thai-jokes-corpusCleaned Thai Jokes Corpus457 jokesGPLv3iApp Technology
Thai named entity corporanamed entity corpora by Wirote Aroonmanakun's students266KB-1.5MBsyllable seg., word seg., Named Entity taggedGPLv3 (not sure, but tltk is using this license)นัชชา ถิระสาโรช Data<br /> ศศิวิมล กาลันสีมา Data<br /> ณัฐดาพร เลิศชีวะ Data
THAI-NESTThai-NEST: Thai Named Entity tagging Specification and Tools45K+ Name Entity TokenName Entity TaggedLGPLKINDML
Thai Sentimental Word ListThai Sentimental Words List52KBSeperated Words as Adj, VMITWannaphongcom
Thai WikipediaFormal Articles1.49GB (~213.1 MB compressed)XMLGFDLWIKIPEDIA
Thai WordNetTHE CONSTRUCTION OF THAI WORDNET OF 1ST ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD AND WITH DICTIONARIES OF DIFFERENT COMPILATIONAL APPROACHES(ธนนท์ หลีน้อย) <br /> <br /> THE CONSTRUCTION OF THAI WORDNET OF 2ND ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD : A STUDY OF THE DIVERSITY OF MEANINGS AFFECTING TRANSLATIONAL ACCURACY (ปริศนา อัครพุทธิพร)WordNetN/Aธนนท์ หลีน้อย 2008<br />ปริศนา อัครพุทธิพร Data 2008
TNC Top-5000 WordsWord frequency5,000 wordsFrequency of Thai words in various genres, EXCELAll rights reservedCHULA
Toxicity in Thai Tweet CorpusTokyo Metropolitan University Natural Language Processing GroupEach tweet is labeled as toxic or non-toxicCC BY-NC 4.0tmu-nlp
Wisesight Sentiment CorpusSocial media message with sentiment label (positive, neutral, negative, question).~26,700 messagesSentiment label, Question labelPublic domainPyThaiNLP

Web Query Text Corpus

LibraryDescriptionSizeFeaturesLicenseLink
Thai National Corpus 232M wordsQuery text by genre, domainAll rights reservedCHULA
Thai Medical Document3,594 docsDocument and dynamic keyword mapAll rights reservedKINDML, SIIT
Southeast Asian Languages LibraryThai News, Web Text, Pop Music, Literature, Toponyms20M charsPhase around a search textSEALang
HSE Thai CorpusModern texts written in Thai language (mostly news websites)50M tokensQuery by word form, lexeme, translation, grammatical attributes, lexical attributeesHSE School of Linguistics

Parallel Corpus

LibraryDescriptionSizeFeaturesLicenseLink
TALPCoTUFS Asian Language Parallel Corpus1327 sentopen parallel corpus consisting of Japanese sentences and their translations into Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and EnglishCC BY 4.0TALPCo

Pre-trained Language Models

Pre-trained ModelDescriptionSizeDimensionsLicenseLink
fastTextSkip-Gram model trained on Wikipedia using fastText300CC BY-SA 3.0Facebook + Bin & Text + Text Only
thai2fitULMFit on Wikipedia. Perplexity of 46.80959 with 60,002 embeddings.70MB300MITthai2vec / PyThaiNLP
thbertYet another pre-trained BERT particularly in ThaiApache 2.0tchayintr

Benchmarks

Thai Text Classification Benchmarks

Tools

Corpus extractors

LibraryDescriptionProgramming LanguagesFeaturesLicenseAuthor & Link
BEST2010 cookerA tool for extracting segmented words from Thai segmented BEST2010 corpusPython3Extracting segmented words, features, and data divisionsApache 2.0tchayintr

Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)

https://resources.aiat.or.th/

Acknowledgements