Home

Awesome

ThaiLMCut - Word Tokenizer for Thai Language based on Transfer Learning and bidirectional-LSTM

About

<p align="center"><img src="https://github.com/meanna/ThaiLMCUT/blob/master/graphic_lmcut/pic_lm.png?raw=true" width="368"><img src="https://github.com/meanna/ThaiLMCUT/blob/master/graphic_lmcut/pic_ws.png?raw=true" width="368"></p>

Update

Requirements

Install LMCut as package

Download the weight file from:

Tokenizer models

https://drive.google.com/drive/folders/1rUs765_FzalZWOJRSRL0cbQGW3lrV4JM?usp=sharing
https://drive.google.com/drive/folders/1hJ4jsXdypP4mqZDsEgxEEfO-CLwAzHTj?usp=sharing

Move the weight file to this directory:

lmcut/weight/

Create a package wheel using:

python setup.py bdist_wheel

Install the package using:

pip install dist/lmcut*

How to use LMCut

Tokenize a given Thai text

from lmcut import tokenize
text = "โรงแรมดี สวยงามน่าอยู่มากๆ"
result = tokenize(text)
print(result)

Result is a list of tokens:

['โรง', 'แรม', 'ดี', 'สวยงาม', 'น่า', 'อยู่', 'มาก', 'ๆ']

Train a language model

Prepare dataset for training

python train/LanguageModel.py \
--dataset [dataset name] \
--batchSize 60 \
--char_dropout_prob 0.01 \
--char_embedding_size 200 \
--hidden_dim 500 \
--layer_num 3 \
--learning_rate 0.0001 \
--sequence_length 100 \
--epoch 10 \
--len_lines_per_chunk 1000 \
--optim [adam or sgd] \
--lstm_num_direction [2 means bidirectional and 1 means uni directional] \
--add_note "..add some note.."

Command example

python train/LanguageModel.py \
--dataset default \
--batchSize 32 \
--char_dropout_prob 0.01 \
--char_embedding_size 100 \
--hidden_dim 100 \
--layer_num 2 \
--learning_rate 0.0001 \
--sequence_length 100 \
--epoch 3 \
--len_lines_per_chunk 100 \
--optim adam \
--lstm_num_direction 2 \
--add_note "test if code runs properly"

To resume the training of a language model, run

python train/LanguageModel.py \
--load_from [model name to resume] \
--dataset [dataset name] \
--epoch 3 \
--learning_rate 0.0001 \
--over_write 1

Pretrained language model

https://drive.google.com/drive/folders/1QKOctAPYIpC7b3beLGvOJ-h43-T85Yjy?usp=sharing

command (note: ty dataset is not publicly available)

python train/LanguageModel.py \
--dataset ty \
--batchSize 64 \
--char_dropout_prob 0.01 \
--char_embedding_size 200 \
--clip_grad 0.5 \
--hidden_dim 514 \
--layer_num 2 \
--learning_rate 0.0001 \
--sequence_length 150 \
--epoch 20 \
--len_lines_per_chunk 1000 \
--optim adam \
--lstm_num_direction 2 \
--lr_decay 0.01 \
--sgd_momentum 0.02

Train a new tokenizer

To train a new tokenizer, you could run:

python train/Tokenizer.py \
--dataset default \
--epoch 3 \
--lstm_num_direction 2 \
--batchSize 30 \
--sequence_length 80 \
--char_embedding_size 100 \
--hidden_dim 60 \
--layer_num 2 \
--optim adam \
--learning_rate 0.0001

To load a pre-trained language model(the embedding layer and recurrent layer) to the tokenizer and train, you could run

python train/Tokenizer.py \
--load_from [language model name] \
--dataset default \
--epoch 2 \
--learning_rate 0.0001 \
--over_write 0

To resume the training of a tokenizer, you could run

python train/Tokenizer.py \
--load_from [tokenizer name] \
--dataset default \
--epoch 2 \
--learning_rate 0.0001 \
--over_write 1

Credits

Acknowledgements

The project is funded by TrustYou. The author would like to sincerely thank TrustYou and other contributors.

Contributors

License

All original code in this project is licensed under the MIT License. See the included LICENSE file.