Home

Awesome

NEMATUS

Attention-based encoder-decoder model for neural machine translation built in Tensorflow.

Notable features include:

SUPPORT

For general support requests, there is a Google Groups mailing list at https://groups.google.com/d/forum/nematus-support . You can also send an e-mail to nematus-support@googlegroups.com .

INSTALLATION

Nematus requires the following packages:

To install tensorflow, we recommend following the steps at: ( https://www.tensorflow.org/install/ )

the following packages are optional, but highly recommended

LEGACY THEANO VERSION

Nematus originated as a fork of dl4mt-tutorial by Kyunghyun Cho et al. ( https://github.com/nyu-dl/dl4mt-tutorial ), and was implemented in Theano. See https://github.com/EdinburghNLP/nematus/tree/theano for this Theano-based version of Nematus.

To use models trained with Theano with the current Tensorflow codebase, use the script nematus/theano_tf_convert.py.

DOCKER USAGE

You can also create docker image by running following command, where you change suffix to either cpu or gpu:

docker build -t nematus-docker -f Dockerfile.suffix .

To run a CPU docker instance with the current working directory shared with the Docker container, execute:

docker run -v `pwd`:/playground -it nematus-docker

For GPU you need to have nvidia-docker installed and run:

nvidia-docker run -v `pwd`:/playground -it nematus-docker

TRAINING SPEED

Training speed depends heavily on having appropriate hardware (ideally a recent NVIDIA GPU), and having installed the appropriate software packages.

To test your setup, we provide some speed benchmarks with `test/test_train.sh', on an Intel Xeon CPU E5-2620 v4, with a Nvidia GeForce GTX Titan X (Pascal) and CUDA 9.0:

GPU, CuDNN 5.1, tensorflow 1.0.1:

CUDA_VISIBLE_DEVICES=0 ./test_train.sh

225.25 sentenses/s

USAGE INSTRUCTIONS

All of the scripts below can be run with --help flag to get usage information.

Sample commands with toy examples are available in the test directory; for training a full-scale RNN system, consider the training scripts at http://data.statmt.org/wmt17_systems/training/

An updated version of these scripts that uses the Transformer model can be found at https://github.com/EdinburghNLP/wmt17-transformer-scripts

nematus/train.py : use to train a new model

data sets; model loading and saving

parameterdescription
--source_dataset PATHparallel training corpus (source)
--target_dataset PATHparallel training corpus (target)
--dictionaries PATH [PATH ...]network vocabularies (one per source factor, plus target vocabulary)
--save_freq INTsave frequency (default: 30000)
--model PATHmodel file name (default: model)
--reload PATHload existing model from this path. Set to "latest_checkpoint" to reload the latest checkpoint in the same directory of --model
--no_reload_training_progressdon't reload training progress (only used if --reload is enabled)
--summary_dir PATHdirectory for saving summaries (default: same directory as the --model file)
--summary_freq INTSave summaries after INT updates, if 0 do not save summaries (default: 0)

network parameters (all model types)

parameterdescription
--model_type {rnn,transformer}model type (default: rnn)
--embedding_size INTembedding layer size (default: 512)
--state_size INThidden state size (default: 1000)
--source_vocab_sizes INT [INT ...]source vocabulary sizes (one per input factor) (default: None)
--target_vocab_size INTtarget vocabulary size (default: -1)
--factors INTnumber of input factors (default: 1) - CURRENTLY ONLY WORKS FOR 'rnn' MODEL
--dim_per_factor INT [INT ...]list of word vector dimensionalities (one per factor): '--dim_per_factor 250 200 50' for total dimensionality of 500 (default: None)
--tie_encoder_decoder_embeddingstie the input embeddings of the encoder and the decoder (first factor only). Source and target vocabulary size must be the same
--tie_decoder_embeddingstie the input embeddings of the decoder with the softmax output embeddings
--output_hidden_activation {tanh,relu,prelu,linear}activation function in hidden layer of the output network (default: tanh) - CURRENTLY ONLY WORKS FOR 'rnn' MODEL
--softmax_mixture_size INTnumber of softmax components to use (default: 1) - CURRENTLY ONLY WORKS FOR 'rnn' MODEL

network parameters (rnn-specific)

parameterdescription
--rnn_enc_depth INTnumber of encoder layers (default: 1)
--rnn_enc_transition_depth INTnumber of GRU transition operations applied in the encoder. Minimum is 1. (Only applies to gru). (default: 1)
--rnn_dec_depth INTnumber of decoder layers (default: 1)
--rnn_dec_base_transition_depth INTnumber of GRU transition operations applied in the first layer of the decoder. Minimum is 2. (Only applies to gru_cond). (default: 2)
--rnn_dec_high_transition_depth INTnumber of GRU transition operations applied in the higher layers of the decoder. Minimum is 1. (Only applies to gru). (default: 1)
--rnn_dec_deep_contextpass context vector (from first layer) to deep decoder layers
--rnn_dropout_embedding FLOATdropout for input embeddings (0: no dropout) (default: 0.0)
--rnn_dropout_hidden FLOATdropout for hidden layer (0: no dropout) (default: 0.0)
--rnn_dropout_source FLOATdropout source words (0: no dropout) (default: 0.0)
--rnn_dropout_target FLOATdropout target words (0: no dropout) (default: 0.0)
--rnn_layer_normalisationSet to use layer normalization in encoder and decoder
--rnn_lexical_modelEnable feedforward lexical model (Nguyen and Chiang, 2018)

network parameters (transformer-specific)

parameterdescription
--transformer_enc_depth INTnumber of encoder layers (default: 6)
--transformer_dec_depth INTnumber of decoder layers (default: 6)
--transformer_ffn_hidden_size INTinner dimensionality of feed-forward sub-layers (default: 2048)
--transformer_num_heads INTnumber of attention heads used in multi-head attention (default: 8)
--transformer_dropout_embeddings FLOATdropout applied to sums of word embeddings and positional encodings (default: 0.1)
--transformer_dropout_residual FLOATdropout applied to residual connections (default: 0.1)
--transformer_dropout_relu FLOATdropout applied to the internal activation of the feed-forward sub-layers (default: 0.1)
--transformer_dropout_attn FLOATdropout applied to attention weights (default: 0.1)
--transformer_drophead FLOATdropout of entire attention heads (default: 0.0)

training parameters

parameterdescription
--loss_function {cross-entropy,per-token-cross-entropy, MRT}loss function. MRT: Minimum Risk Training https://www.aclweb.org/anthology/P/P16/P16-1159.pdf) (default: cross-entropy)
--decay_c FLOATL2 regularization penalty (default: 0.0)
--map_decay_c FLOATMAP-L2 regularization penalty towards original weights (default: 0.0)
--prior_model PATHPrior model for MAP-L2 regularization. Unless using " --reload", this will also be used for initialization.
--clip_c FLOATgradient clipping threshold (default: 1.0)
--label_smoothing FLOATlabel smoothing (default: 0.0)
--exponential_smoothing FLOATexponential smoothing factor; use 0 to disable (default: 0.0)
--optimizer {adam}optimizer (default: adam)
--adam_beta1 FLOATexponential decay rate for the first moment estimates (default: 0.9)
--adam_beta2 FLOATexponential decay rate for the second moment estimates (default: 0.999)
--adam_epsilon FLOATconstant for numerical stability (default: 1e-08)
--learning_schedule {constant,transformer,warmup-plateau-decay}learning schedule (default: constant)
--learning_rate FLOATlearning rate (default: 0.0001)
--warmup_steps INTnumber of initial updates during which the learning rate is increased linearly during learning rate scheduling (default: 8000)
--plateau_steps INTnumber of updates after warm-up before the learning rate starts to decay (applies to 'warmup-plateau-decay' learning schedule only). (default: 0)
--maxlen INTmaximum sequence length for training and validation (default: 100)
--batch_size INTminibatch size (default: 80)
--token_batch_size INTminibatch size (expressed in number of source or target tokens). Sentence-level minibatch size will be dynamic. If this is enabled, batch_size only affects sorting by length. (default: 0)
--max_sentences_per_device INTmaximum size of minibatch subset to run on a single device, in number of sentences (default: 0)
--max_tokens_per_device INTmaximum size of minibatch subset to run on a single device, in number of tokens (either source or target - whichever is highest) (default: 0)
--gradient_aggregation_steps INTnumber of times to accumulate gradients before aggregating and applying; the minibatch is split between steps, so adding more steps allows larger minibatches to be used (default: 1)
--maxibatch_size INTsize of maxibatch (number of minibatches that are sorted by length) (default: 20)
--no_sort_by_lengthdo not sort sentences in maxibatch by length
--no_shuffledisable shuffling of training data (for each epoch)
--keep_train_set_in_memoryKeep training dataset lines stores in RAM during training
--max_epochs INTmaximum number of epochs (default: 5000)
--finish_after INTmaximum number of updates (minibatches) (default: 10000000)
--print_per_token_pro PATHPATH to store the probability of each target token given source sentences over the training dataset (without training). If set to False, the function will not be triggered. (default: False). Please get rid of the 1.0s at the end of each list which are the probability of padding.

minimum risk training parameters (MRT)

parameterdescription
--mrt_referenceadd reference into MRT candidates sentences (default: False)
--mrt_alpha FLOATMRT alpha to control the sharpness of the distribution of sampled subspace (default: 0.005)
--samplesN INTthe number of sampled candidates sentences per source sentence (default: 100)
--mrt_lossevaluation metrics used to compute loss between the candidate translation and reference translation (default: SENTENCEBLEU n=4)
--mrt_ml_mix FLOATmix in MLE objective in MRT training with this scaling factor (default: 0)
--sample_way {beam_search, randomly_sample}the sampling strategy to generate candidates sentences (default: beam_search)
--max_len_a INTgenerate candidates sentences with maximum length: ax + b, where x is the length of the source sentence (default: 1.5)
--max_len_b INTgenerate candidates sentences with maximum length: ax + b, where x is the length of the source sentence (default: 5)
--max_sentences_of_sampling INTmaximum number of source sentences to generate candidates sentences at one time (limited by device memory capacity) (default: 0)

validation parameters

parameterdescription
--valid_source_dataset PATHsource validation corpus (default: None)
--valid_target_dataset PATHtarget validation corpus (default: None)
--valid_batch_size INTvalidation minibatch size (default: 80)
--valid_token_batch_size INTvalidation minibatch size (expressed in number of source or target tokens). Sentence-level minibatch size will be dynamic. If this is enabled, valid_batch_size only affects sorting by length. (default: 0)
--valid_freq INTvalidation frequency (default: 10000)
--valid_script PATHpath to script for external validation (default: None). The script will be passed an argument specifying the path of a file that contains translations of the source validation corpus. It must write a single score to standard output.
--valid_bleu_source_dataset PATHsource validation corpus for external validation (default: None). If set to None, the dataset for calculating validation loss (valid_source_dataset) will be used
--patience INTearly stopping patience (default: 10)

display parameters

parameterdescription
--disp_freq INTdisplay loss after INT updates (default: 1000)
--sample_freq INTdisplay some samples after INT updates (default: 10000)
--beam_freq INTdisplay some beam_search samples after INT updates (default: 10000)
--beam_size INTsize of the beam (default: 12)

translate parameters

parameterdescription
--normalization_alpha [ALPHA]normalize scores by sentence length (with argument, " "exponentiate lengths by ALPHA)
--n_bestPrint full beam
--translation_maxlen INTMaximum length of translation output sentence (default: 200)
--translation_strategy {beam_search,sampling}translation_strategy, either beam_search or sampling (default: beam_search)

nematus/translate.py : use an existing model to translate a source text

parameterdescription
-v, --verboseverbose mode
-m PATH [PATH ...], --models PATH [PATH ...]model to use; provide multiple models (with same vocabulary) for ensemble decoding
-b INT, --minibatch_size INTminibatch size (default: 80)
-i PATH, --input PATHinput file (default: standard input)
-o PATH, --output PATHoutput file (default: standard output)
-k INT, --beam_size INTbeam size (default: 5)
-n [ALPHA], --normalization_alpha [ALPHA]normalize scores by sentence length (with argument, exponentiate lengths by ALPHA)
--n_bestwrite n-best list (of size k)
--maxibatch_size INTsize of maxibatch (number of minibatches that are sorted by length) (default: 20)

nematus/score.py : use an existing model to score a parallel corpus

parameterdescription
-v, --verboseverbose mode
-m PATH [PATH ...], --models PATH [PATH ...]model to use; provide multiple models (with same vocabulary) for ensemble decoding
-b INT, --minibatch_size INTminibatch size (default: 80)
-n [ALPHA], --normalization_alpha [ALPHA]normalize scores by sentence length (with argument, exponentiate lengths by ALPHA)
-o PATH, --output PATHoutput file (default: standard output)
-s PATH, --source PATHsource text file
-t PATH, --target PATHtarget text file

nematus/rescore.py : use an existing model to rescore an n-best list.

The n-best list is assumed to have the same format as Moses:

sentence-ID (starting from 0) ||| translation ||| scores

new scores will be appended to the end. rescore.py has the same arguments as score.py, with the exception of this additional parameter:

parameterdescription
-i PATH, --input PATHinput n-best list file (default: standard input)

nematus/theano_tf_convert.py : convert an existing theano model to a tensorflow model

If you have a Theano model (model.npz) with network architecture features that are currently supported then you can convert it into a tensorflow model using nematus/theano_tf_convert.py.

parameterdescription
--from_theanoconvert from Theano to TensorFlow format
--from_tfconvert from Tensorflow to Theano format
--in PATHpath to input model
--out PATHpath to output model

PUBLICATIONS

if you use Nematus, please cite the following paper:

Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry and Maria Nadejde (2017): Nematus: a Toolkit for Neural Machine Translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, pp. 65-68.

@InProceedings{sennrich-EtAl:2017:EACLDemo,
  author    = {Sennrich, Rico  and  Firat, Orhan  and  Cho, Kyunghyun  and  Birch, Alexandra  and  Haddow, Barry  and  Hitschler, Julian  and  Junczys-Dowmunt, Marcin  and  L\"{a}ubli, Samuel  and  Miceli Barone, Antonio Valerio  and  Mokry, Jozef  and  Nadejde, Maria},
  title     = {Nematus: a Toolkit for Neural Machine Translation},
  booktitle = {Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics},
  month     = {April},
  year      = {2017},
  address   = {Valencia, Spain},
  publisher = {Association for Computational Linguistics},
  pages     = {65--68},
  url       = {http://aclweb.org/anthology/E17-3017}
}

the code is based on the following models:

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (2015): Neural Machine Translation by Jointly Learning to Align and Translate, Proceedings of the International Conference on Learning Representations (ICLR).

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017): Attention is All You Need, Advances in Neural Information Processing Systems (NIPS).

please refer to the Nematus paper for a description of implementation differences to the RNN model.

ACKNOWLEDGMENTS

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreements 645452 (QT21), 644333 (TraMOOC), 644402 (HimL) and 688139 (SUMMA).