Awesome
Thai Word Tokenizers
This repository is a collection of almost all Thai tokenisers that are publicly available. Having this collection allows us to try each algorithm as ease via Docker.
Technically, each project (called vendor
) has its own Docker image with a entry
script and auxiliary scripts.
These scripts bring a unified interface, allowing us to run those algorithms in the same way.
Vendors
Vendor | Alias | Available Methods | Container Profile |
---|---|---|---|
PyThaiNLP | pythainlp | newmm, longest | |
DeepCut | deepcut | deepcut | |
CutKum | cutkum | cutkum | |
Sertis | sertis | sertis | |
Thai Language Toolkit | tltk | mm, ngram, colloc | |
Smart Word Analysis for Thai (SWATH) | swath | max, long | |
Chrome's v8Breakiterator | chrome | v8breakiterator |
Please see Usages for more details.
Setup
- Pull necessary Docker images. Please check Docker Hub for the avaliable images.
$ docker pull pythainlp/word-tokenizers:<vendor-alias>
Usages
- Put text files that you want to tokenise into
./data
. - Run the following command ...
$ ./scripts/tokenise.sh <vendor-alias>-<method> <**filename**>
Please check Vendors section for vendors and methods included here.
Example
Let's say you want to tokenise text in ./data/example.text
using PyThaiNLP's newmm
algorithm. You can use the following command:
$ cat ./data/example.text
อันนี้คือตัวอย่าง
$ ./scripts/tokenise.sh pythainlp:newmm example.text
# Please be aware that you don't need to have ./data in front of the filename.
# Command Output
Tokenising example.text using vendor=pythainlp and method=newmm
CMD: docker run -v /Users/heytitle/projects/tokenisers-for-thai/data:/data thai-tokeniser:pythainlp newmm example.text
100%|██████████| 1/1 [00:00<00:00, 151.70it/s]
Tokenising /data/example.text with newmm
Tokenised text is written to /data/example_tokenised-pythainlp-newmm.text
$ cat ./data/example_tokenised-pythainlp-newmm.text
อันนี้|คือ|ตัวอย่าง
Development
Architecture
TBD.
Build a vendor's new Docker image
$ ./scripts/build <vendor>
Push a new Docker image to Docker Hub
$ ./scripts/push <vendor>
Acknowledgements
- This repository was initially done by Pattarawat Chormai, whiling interning at Dr. Attapol Thamrongrattanarit's NLP Lab, Chulalongkorn University, Bangkok, Thailand.