Awesome
Awesome Lao NLP
This is the list of Lao Natural Language Processing.
Corpus
Part-of-speech
- SeqLabeling: corpus from https://github.com/FoVNull/SeqLabeling
- yunshan_cup_2020: corpus from https://github.com/GKLMIP/Yunshan-Cup-2020
Dictionary
- Ministry of Posts and Telecommunications (MOPT) dictionary https://github.com/google/language-resources/tree/master/third_party/lo_dictionary_by_mopt_laos
- lo_spellcheck_dict https://github.com/google/language-resources/tree/master/lo/data_sets
Parallel corpus
- Thai Lao Parallel corpus https://github.com/PyThaiNLP/Thai-Lao-Parallel-Corpus
- Asian Language Treebank Parallel Corpus https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/
Text corpus
- Multilingual Open Text (MOT) https://github.com/bltlab/mot
- OSCAR: Open Super-large Crawled Aggregated coRpus https://oscar-corpus.com/
- Common Crawl corpus https://commoncrawl.org/
Speech Recognition
Libraries
Python
- LaoNLP - Lao language Natural Language Processing toolkit
- Chamkho - Khmer, Lao, Myanmar, and Thai word segmentation/breaking library and command line
- Whisper - Whisper is a general-purpose speech recognition model.
C/C++
- ICU - International Components for Unicode
Perl
- Lingua::LO::NLP - Various Lao text processing functions
Java
- SEANLP: Southeast Asia Natural Language Processing
Pretrained
- GKLMIP - They have many lao language models pretrained; bert, electra. You can read LaoPLM: Pre-trained Language Models for Lao paper.