Awesome
CutThai
if you find javascript library for Thai word segmentation in production. I strongly recommend wordcut This repository is use for describe how Thai word segmentation work.
This work is base on document of wordcut that you can found on meduim (Thai language)
Algorithm
1. Find wordlist
this work is use Dictionary base you must have some Thai wordlist. you can found some Thai wordlist from
2. Build word Trie
convert wordlist from step 1 into trie to increase speed of searching. read more about trie: Wikipedia - Trie Note: This step is difference from wordcut, it using Binary search
3.Create wordgraph
Wordgraph is graph. use to determine position to word Segmentation where vertex is position to segmentation and Edge is word. create edge by compare input with trie.
4.Find shortest path
Find shortest path from start vertex to end vertex by using SPFA read more about SPFA: Wikipedia - SPFA
5.Segmentation sentense to array
use shortest path from step 4 to segmentation sentense and convert to array
Usage
CutThai isn't recommend to use in production. but you can download lastest release from Releases
by using Node.js or CommonJS
var CutThai = require("cutthai")
by using normal browser
<script src="path/to/cutthai.min.js"></script>
run some segmentation
var cutthai = new CutThai(function(err){
if(err){
throw err;
}
console.log(cutthai.cut("ฉันกินข้าว"));
});
Thank
wordcut - for Algorithm to Thai word segmentaion LibThai - for Thai word dictionary
Note: This document isn't complete yet. need to improve gramma add more picture to describe Algorithm. add more instruction to build.