Awesome
MT Tutorial for the JSALT 2019 Summer School
This is the Machine Translation tutorial given at the 2019 JSALT summer school.
Course Materials
Slides from the lecture are available here .
Lab session
Setup
1. Installing python and CUDA
You will need python >= 3.6 to run this tutorial. You can get a (relatively lightweight) distribution from miniconda. You will also be requiring a version of CUDA if you want to run the code on GPU. If you are attending the JSALT2019 summer school you should have access to a server with a GPU and CUDA.
2. Installing PyTorch and other python packages
Next, install the required python packages (we recommend setting up a virtual environment: pip install virtualenv && virtualenv env && source env/bin/activate
beforehand):
- Install pytorch (for autodiff/GPU/neural networks):
pip install torch>=1.0
(if this doesn't work, see https://pytorch.org/get-started/locally/#start-locally for custom installation options depending on your environment). Also install tqdm (pip install tqdm
) for progress bars. - Install sentencepiece (for subwords):
pip install sentencepiece
- Install sacrebleu (for evaluating BLEU score):
pip install sacrebleu
3. Get the data
We'll be doing French to English translation on the IWSLT2016 dataset. We've prepared the data for you, you can download it with:
wget https://github.com/pmichel31415/jsalt-2019-mt-tutorial/releases/download/1.0/data.zip
Subwords
Before deciding on our model and training, we need to segment our data into sub-word units. This will allow us to handle out-of-vocabulary words by segmenting them into sub-word units. In particular, we'll be using the Byte-Pair Encodings (BPE) algorithm (described in the lecture).
We first learn a sub-word model from the data. Specifically, starting from single characters BPE will greedily group together frequently co-occuring subwords until the specified vocabulary size has been reached. You can train the sub-word model by running the lab/subword.py
script:
python lab/subwords.py train \
--model_prefix data/subwords \
--vocab_size 16000 \
--model_type bpe \
--input data/train.en,data/train.fr
Importantly, you'll notice that we are learning one sub-word model from both the French and English data. This is because we'll want to have a shared vocabulary between the source and target language. Since there is a significant amount of overlap between French and English vocabulary, this will make it easier for our model to map similar words together (e.g. importance
-> importance
, docteur
-> doctor
etc...).
Take a look at the vocabulary file containing all learned subwords with less data/subwords.vocab
. You might recognize common vowels and even frequent words. Had we increased the target vocabulary size, we would've ended longer subwords on average. You will also notice the weird ▁
character. This is sentencepiece's way of indicating spaces. This makes it possible to distinguish subwords that occur at the start of words vs those that occurs with words.
We'll now use the trained sub-word model to segment existing sentences:
echo "JSALT summer school is positively interesting." | python lab/subwords.py segment --model data/subwords.model
produces
▁J SA L T ▁summer ▁school ▁is ▁positive ly ▁interesting .
You can see that:
- spaces are replaced with
▁
- common words like
summer
andinteresting
are kept as-is - less common words are like the adverb
positively
are split (here the adverbializerly
is detached) - the unknown word JSALT is split into a lot of sub-words
For convenience we've provided the segmented versions of the data files in data/*.bpe.{en,fr}
. Take a look to get an idea of what the segmented input look like!
The Transformer Model
We're going to do translation with a variation on the transformer model from Attention Is All You Need:
<div align="center"> <img height="400" src="images/transformer.png" alt="Transformer architecture"/> </div>This architecture relies on 3 different "modules":
- Word embeddings mapping word indices to learned vectors. This is implemented in pytorch with the
nn.Embedding
- Multi-head attention. This is implemented in
MultiHeadAttention
inlab/transformer.py
- Position-wise feed-forward layers layers (2 layer MLP). Implemented as
FeedForwardTransducer
.
Additionally, it relies on position embeddings (to allow the model to consider each token's relative position), residual connections (for better gradient flow/expressivity at high layer counts) and layer normalization (to keep the amplitude of each layer in check).
In particular in our implementation we are using a small tweak on the original model where layer normalization is applied before each layer and after the residual connection. Empirically this makes the model converge faster.
TODO 1: We've implemented most of the transformer except for the forward methods of the encoder and decoder layers (EncoderLayer
and DecoderLayer
in lab/transformer.py
).
to verify that your implementation is correct, first download our pretrained model:
wget https://github.com/pmichel31415/jsalt-2019-mt-tutorial/releases/download/1.0/model.pt
And run
python lab/training.py --cuda --model-file model.pt --validate-only
If your implementation is correct this should give you a perplexity of 5.57 with the provided model.
Training a model
You can train a transformer model by running python lab\training.py
:
python lab/training.py \
--cuda \
--n-layers 4 \
--n-heads 4 \
--embed-dim 512 \
--hidden-dim 512 \
--dropout 0.1 \
--lr 2e-4 \
--n-epochs 15 \
--tokens-per-batch 8000 \
--clip-grad 1.0
This will train a transformer with the following parameters:
--cuda
: CUDA (train on GPU)--n-layers 4
: 4 layers (both in the encoder and decoder)--n-heads 4
: 4 attention heads in each attention layer--embed-dim 512
: This sets the dimension of the model to 512 (including word embeddings)--hidden-dim 512
: This sets the dimension of the hidden layer in the feedforward layers to 512--dropout 0.1
: Dropout (set higher for more regularization)--lr 2e-4
: Learning rate (1/sqrt(embed_dim)
is a good heuristic)--n-epochs 15
: Maximum number of epochs--tokens-per-batch 8000
: Batch sentences together so that each batch contains at least 8000 tokens maximum--clip-grad 1.0
: Clip gradient norm at 1.0
For convenience, we've trained a model for you (https://github.com/pmichel31415/jsalt-2019-mt-tutorial/releases/download/1.0/data.zip)
Sampling from a trained model
One of the easiest way of generating a translation with the model is to sample from the conditional distribution one word at a time. This is implemented in lab/decode.py
. However in order for decoding to be efficient, we need to implement another function in DecoderLayer
:
TODO 2: Implement the decode_step
method in DecoderLayer
. This method allows us to perform one step of decoding (return log p (y_t | x, y_1,...,y_{t-1})
).
You can verify that your implementation is correct by running:
echo "▁J ' ai ▁donc ▁fait ▁le ▁tour ▁pour ▁essayer ▁les ▁autres ▁portes ▁et ▁fenêtres ." |
python lab/translate.py --model-file model.pt --search "random"
Which should return:
So I went all the way to try out the other doors and windows.
Congrats! You've just used your MT model to translate something for the first time.
Since we are doing random sampling, you can get different results by using a different random seed:
echo "▁J ' ai ▁donc ▁fait ▁le ▁tour ▁pour ▁essayer ▁les ▁autres ▁portes ▁et ▁fenêtres ." |
python lab/translate.py --model-file model.pt --search "random" --seed 123456
should give:
Thus, I went around in order to try to other doors and windows.
Greedy decoding
Random sampling is not optimal for decoding. Ideally we'd want to generate the argmax of the conditional distribution p(y|x)
. However, with auto-regressive model that don't satisfy any kind of markov property this is intractable (we would need to explore an infinite number of possible translations).
A first approximation is to do "greedy" decoding: at each step fo decoding, instead of sampling, select the most probable token according to the model (Side question: why is this not the same as finding the argmax of p(y|x)
? Can you come up with a simple example where this would be sub-optimal?).
TODO 3: Implement greedy decoding in lab/decoding.py
.
You can test your results by verifying that:
echo "▁J ' ai ▁donc ▁fait ▁le ▁tour ▁pour ▁essayer ▁les ▁autres ▁portes ▁et ▁fenêtres ." |
python lab/translate.py --model-file model.pt --search "greedy"
Gives you
So I went around to try and try the other doors and windows.
Evaluating BLEU score
We're now ready to evaluate the model's BLEU score. You can translate a subet of the test data in data/toy.test.bpe.fr
with
python lab/translate.py \
--cuda \
--model-file model.pt \
--search "greedy" \
--input-file data/toy.test.bpe.fr \
--output-file toy.test.greedy.en
Take a look at the output file toy.test.out.en
to get a feel of the translation quality. Now evaluate BLEU score with
cat toy.test.greedy.en | sacrebleu data/toy.test.en
You should get around 26.9 BLEU score.
TODO 4: Compare the BLEU scores with random and greedy decoding.
Beam Search
As alluded to earlier, greedy decoding (while better than random sampling) is not optimal. For example, the first word completely determins the generated translation, with no chance to recover. Beam search is a slightly better approximation of the structured argmax problem.
In beam search, we keep track of the top k
hypotheses (or beams) at every given step. Thus, hypotheses that have a lower probability in the first steps have a chance to recover.
TODO 5: Implement beam search in lab/decoding.py
. This is a bit harder than the previous exercises so don't hesitate to ask for help.
You can test your implementation by verifying that setting the beam size to 1 gives you the same result.
Can you get better BLEU score than with greedy decoding? Try:
python lab/translate.py \
--cuda \
--model-file model.pt \
--search "beam_search" \
--beam-size 2 \
--input-file data/toy.test.bpe.fr \
--output-file toy.test.beam.2.en
You should get a BLEU score of around 28.1. This is pretty good considering that we didn't change the model at all! Try higher beam sizes.
You can try to improve your translation results by adding penalties for longer sentences, unknown words, etc... Try running your model on the full test set (data/test.bpe.fr
) and report your best BLEU score.
Solutions
If you are stuck at any point, you can find the solution code in the solutions
branch.
What's next?
Feel free to for this repo and train your own models. You can probably get better results with better training techinques (different optimizers/batch size, label smoothing, l2 regularization) and bigger models. Try variations on this transformer architecture (eg. untie the word embeddings, play around with the residual connections and the layer norm, etc...). If you want to train on much bigger datasets, take a look at fairseq. It is a bit more complicated than this codebase but much more efficient.
Organizers
Jia Xu
<img align="left" height="100" src="images/jia_xu_pic.png" alt="Jia pic"/>Main instructor
Assistant Professor
Graduate Center and Hunter College
City University of New York (CUNY)
<br/>Paul Michel
<img align="left" height="100" src="images/paul_michel_pic.jpg" alt="Paul pic"/>Lab instructor
Phd Student
School of Computer Science
Carnegie Mellon University
<br/>Abdul Rafae Khan
<img align="left" height="100" src="images/abdul_rafae_khan_pic.jpg" alt="Abdul pic"/>Lab instructor
PhD Student
Graduate Center and Hunter College
City University of New York (CUNY)