Home

Awesome

Sequence-to-Sequence Learning with Attentional Neural Networks

UPDATE: Check-out the beta release of <a href="http://opennmt.net">OpenNMT</a> a fully supported feature-complete rewrite of seq2seq-attn. Seq2seq-attn will remain supported, but new features and optimizations will focus on the new codebase.

Torch implementation of a standard sequence-to-sequence model with (optional) attention where the encoder-decoder are LSTMs. Encoder can be a bidirectional LSTM. Additionally has the option to use characters (instead of input word embeddings) by running a convolutional neural network followed by a highway network over character embeddings to use as inputs.

The attention model is from Effective Approaches to Attention-based Neural Machine Translation, Luong et al. EMNLP 2015. We use the global-general-attention model with the input-feeding approach from the paper. Input-feeding is optional and can be turned off.

The character model is from Character-Aware Neural Language Models, Kim et al. AAAI 2016.

There are a lot of additional options on top of the baseline model, mainly thanks to the fantastic folks at SYSTRAN. Specifically, there are functionalities which implement:

See below for more details on how to use them.

This project is maintained by Yoon Kim. Feel free to post any questions/issues on the issues page.

Dependencies

Python

Lua

You will need the following packages:

GPU usage will additionally require:

If running the character model, you should also install:

Quickstart

We are going to be working with some example data in data/ folder. First run the data-processing code

python preprocess.py --srcfile data/src-train.txt --targetfile data/targ-train.txt
--srcvalfile data/src-val.txt --targetvalfile data/targ-val.txt --outputfile data/demo

This will take the source/target train/valid files (src-train.txt, targ-train.txt, src-val.txt, targ-val.txt) and make some hdf5 files to be consumed by Lua.

demo.src.dict: Dictionary of source vocab to index mappings. demo.targ.dict: Dictionary of target vocab to index mappings. demo-train.hdf5: hdf5 containing the train data. demo-val.hdf5: hdf5 file containing the validation data.

The *.dict files will be needed when predicting on new data.

Now run the model

th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model

This will run the default model, which consists of a 2-layer LSTM with 500 hidden units on both the encoder/decoder. You can also add -gpuid 1 to use (say) GPU 1 in the cluster.

Now you have a model which you can use to predict on new data. To do this we are going to be running beam search

th evaluate.lua -model demo-model_final.t7 -src_file data/src-val.txt -output_file pred.txt
-src_dict data/demo.src.dict -targ_dict data/demo.targ.dict

This will output predictions into pred.txt. The predictions are going to be quite terrible, as the demo dataset is small. Try running on some larger datasets! For example you can download millions of parallel sentences for translation or summarization.

Details

Preprocessing options (preprocess.py)

Training options (train.lua)

Data options

Model options

Below options only apply if using the character model.

To build a model with guided alignment (implemented similarly to Guided Alignment Training for Topic-Aware Neural Machine Translation (Chen et al. 2016)):

Optimization options

Other options

Decoding options (beam.lua)

hello|||hallo
ukraine|||ukrainische

This dictionary can be obtained by, for example, running an alignment model as a preprocessing step. We recommend fast_align.

1 ||| sentence_1 ||| sentence_1_score
2 ||| sentence_2 ||| sentence_2_score

Using additional input features

Linguistic Input Features Improve Neural Machine Translation (Senrich et al. 2016) shows that translation performance can be increased by using additional input features.

Similarly to this work, you can annotate each word in the source text by using the -|- separator:

word1-|-feat1-|-feat2 word2-|-feat1-|-feat2

It supports an arbitrary number of features with arbitrary labels. However, all input words must have the same number of annotations. See for example data/src-train-case.txt which annotates each word with the case information.

To evaluate the model, the option -feature_dict_prefix is required on evaluate.lua which points to the prefix of the features dictionnaries generated during the preprocessing.

Pruning a model

Compression of Neural Machine Translation Models via Pruning (See et al. 2016) shows that a model can be aggressively pruned while keeping the same performace.

To prune a model - you can use prune.lua which implement class-bind, and class-uniform pruning technique from the paper.

note that the pruning cut connection with lowest weight in the linear models by using a boolean mask. The size of the file is a little larger since it stores the actual full matrix and the binary mask.

Models can be retrained - typically you can recover full capacity of a model pruned at 60% or even 80% by few epochs of additional trainings.

Switching between GPU/CPU models

By default, the model will always save the final model as a CPU model, but it will save the intermediate models as a CPU/GPU model depending on how you specified -gpuid. If you want to run beam search on the CPU with an intermediate model trained on the GPU, you can use convert_to_cpu.lua to convert the model to CPU and run beam search.

GPU memory requirements/Training speed

Training large sequence-to-sequence models can be memory-intensive. Memory requirements will dependent on batch size, maximum sequence length, vocabulary size, and (obviously) model size. Here are some benchmark numbers on a GeForce GTX Titan X. (assuming batch size of 64, maximum sequence length of 50 on both the source/target sequence, vocabulary size of 50000, and word embedding size equal to rnn size):

(prealloc = 0)

Thanks to some fantastic work from folks at SYSTRAN, turning prealloc on will lead to much more memory efficient training

(prealloc = 1)

Tokens/sec refers to total (i.e. source + target) tokens processed per second. If using different batch sizes/sequence length, you should (linearly) scale the above numbers accordingly. You can make use of memory on multiple GPUs by using -gpuid2 option in train.lua. This will put the encoder on the GPU specified by -gpuid, and the decoder on the GPU specified by -gpuid2.

Evaluation

For translation, evaluation via BLEU can be done by taking the output from beam.lua and using the multi-bleu.perl script from Moses. For example

perl multi-bleu.perl gold.txt < pred.txt

Evaluation of States and Attention

attention_extraction.lua can be used to extract the attention and the LSTM states. It uses the following (required) options:

Output of the script are two files, encoder.hdf5 and decoder.hdf5. The encoder contains the states for every layer of the encoder LSTM and the offsets for the start of each source sentence. The decoder contains the states for the decoder LSTM layers and the offsets for the start of gold sentence. It additionally contains the attention for each time step (if the model uses attention).

Pre-trained models

We've uploaded English <-> German models trained on 4 million sentences from Workshop on Machine Translation 2015. Download link is below:

https://drive.google.com/open?id=0BzhmYioWLRn_aEVnd0ZNcWd0Y2c

These models are 4-layer LSTMs with 1000 hidden units and essentially replicates the results from Effective Approaches to Attention-based Neural Machine Translation, Luong et al. EMNLP 2015.

Acknowledgments

Our implementation utilizes code from the following:

Licence

MIT