Home

Awesome

Thai Word Tokenizers

Publish Docker

This repository is a collection of almost all Thai tokenisers that are publicly available. Having this collection allows us to try each algorithm as ease via Docker.

Technically, each project (called vendor) has its own Docker image with a entry script and auxiliary scripts. These scripts bring a unified interface, allowing us to run those algorithms in the same way.

Vendors

VendorAliasAvailable MethodsContainer Profile
PyThaiNLPpythainlpnewmm, longest
DeepCutdeepcutdeepcut
CutKumcutkumcutkum
Sertissertissertis
Thai Language Toolkittltkmm, ngram, colloc
Smart Word Analysis for Thai (SWATH)swathmax, long
Chrome's v8Breakiteratorchromev8breakiterator

Please see Usages for more details.

Setup

Usages

  1. Put text files that you want to tokenise into ./data.
  2. Run the following command ...
$ ./scripts/tokenise.sh <vendor-alias>-<method> <**filename**>

Please check Vendors section for vendors and methods included here.

Example

Let's say you want to tokenise text in ./data/example.text using PyThaiNLP's newmm algorithm. You can use the following command:

$ cat ./data/example.text
อันนี้คือตัวอย่าง

$ ./scripts/tokenise.sh pythainlp:newmm example.text
# Please be aware that you don't need to have ./data in front of the filename.
# Command Output
Tokenising example.text using vendor=pythainlp and method=newmm
CMD: docker run -v /Users/heytitle/projects/tokenisers-for-thai/data:/data  thai-tokeniser:pythainlp newmm example.text
100%|██████████| 1/1 [00:00<00:00, 151.70it/s]
Tokenising /data/example.text with newmm
Tokenised text is written to /data/example_tokenised-pythainlp-newmm.text

$ cat ./data/example_tokenised-pythainlp-newmm.text
อันนี้|คือ|ตัวอย่าง

Development

Architecture

TBD.

Build a vendor's new Docker image

$ ./scripts/build <vendor>

Push a new Docker image to Docker Hub

$ ./scripts/push <vendor>

Acknowledgements