Awesome
Transliteration related data files and/or models.
Contains:
- Arabic-English transliteration dataset mined from Wikipedia.
- Trained transliteration modules for Arabic-English, English-Japanese and English-IPA.
The models
The transliteration models provided are recurrent neural networks trained with a CTC loss. For a detailed description of the models, see the paper.
Getting the code for loading and training models
If you want to use some of the models provided in this repository you can use the clstm library, as provided in the branch here.
To clone the repository:
git clone git@github.com:mihaelacr-google/clstm.git
How to use the trained models
The binary clstmfilter
can be used to use an already existing model to transliterate your data.
To build the binary, use the command below. For more on how to install clstm read this. You can read more about scons here.
scons -j 4
For example, if you have a list of Arabic words which you want to transliterate to English, you can run the following commands in your shell:
set -a
load="ar2en.clstm"
./clstmfilter your_data.txt
How to train your own models
If you want to train a new model with your data, you can use the clstmfiltertrain
binary.
To build the binary, run:
scons -j 4
To train the model:
set -a
lr=0.1
./clstmfiltertrain your_train_data.txt your_eval_data.txt
Reproduce our results
If you want to reproduce the results from our paper, you can run:
scons -j 4
set -a
load=ar2en.clstm
with_gt=1
./clstmfilter ar2en-test.txt
This will load the one layer Arabic to English model, and produce the character error rate and the word error rate of the model on the test data. You can similary load the 2 layer model and compute the metrics of that model.