Home

Awesome

Character-Level Neural Machine Translation

This is an implementation of the models described in the paper "A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation". http://arxiv.org/abs/1603.06147

Dependencies:

The majority of the script files are written in pure Theano.<br> In the preprocessing pipeline, there are the following dependencies.<br> Python Libraries: NLTK<br> MOSES: https://github.com/moses-smt/mosesdecoder<br> Subword-NMT (http://arxiv.org/abs/1508.07909): https://github.com/rsennrich/subword-nmt<br>

This code is based on the dl4mt library.<br> link: https://github.com/nyu-dl/dl4mt-tutorial

Be sure to include the path to this library in your PYTHONPATH.

We recommend you to use the latest version of Theano.<br> If you want exact reproduction however, please use the following version of Theano.<br> commit hash: fdfbab37146ee475b3fd17d8d104fb09bf3a8d5c

Preparing Text Corpora:

The original text corpora can be downloaded from http://www.statmt.org/wmt15/translation-task.html<br> Once the downloading is finished, use the 'preprocess.sh' in 'preprocess' directory to preprocess the text files. For the character-level decoders, preprocessing is not necessary however, in order to compare the results with subword-level decoders and other word-level approaches, we apply the same process to all of the target corpora. Finally, use 'build_dictionary_char.py' for character-case and 'build_dictionary_word.py' for subword-case to build the vocabulary.<br> Updating...