Awesome

transformer-aan

Source code for "Accelerating Neural Transformer via an Average Attention Network"

The source code is developed upon <a href="https://github.com/thumt/THUMT">THUMT</a>

The used THUMT for experiments in our paper is downloaded at Jan 11, 2018

About AAN Structure

We introduce two sub-layers for AAN in our ACL paper: one FFN layer (Eq. (1)) and one gating layer (Eq. (2)). However, after our extensive experiments, we observe that the FFN layer is redundant and can be removed without loss of translation quality. In addition, removing FFN layer reduces the amount of model parameters and slightly improves the training speed. It also largely improves the decoding speed.

For re-implementation, we suggest other researchers to use the AAN model without the FFN sub-layer! See how we disable this layer.

Other Implementations

Marian: an efficient NMT toolkit implemented by C++.
Neutron: a pytorch NMT toolkit
translate: a fairseq-based NMT translation toolkit
OpenNMT: a pytorch NMT toolkit

File structure:

train.sh: provides the training script with our used configuration. test.sh: provides the testing script.

directory train and test are generated on WMT14 en-de translation task.

train/eval/log records the approximate BLEU score on development set during training.
test/ contains the decoded development and test dataset, for researchers who are interested in the translations generated by our model.

The processed WMT14 en-de dataset can be found at <a href="https://drive.google.com/open?id=15WRLfle66CO1zIGKbyz0FsFmUcINyb4X">Transformer-AAN-Data</a>. (Original files are downloaded from <a href="https://nlp.stanford.edu/projects/nmt/">Stanford NMT website</a>.)

Requirements

Python: 2.7
Tensorflow >= 1.4.1 (The used version for experiments in our paper is 1.4.1)

Training Parameters

batch_size=3125,device_list=[0],eval_steps=5000,train_steps=100000,save_checkpoint_steps=1500,shared_embedding_and_softmax_weights=true,shared_source_target_embedding=false,update_cycle=8,aan_mask=True,use_ffn=False

train_steps: The total training steps, we used 100000 in most experiments.
eval_steps: We obtain the approximate BLEU score on development set in every 5000 training steps.
shared_embedding_and_softmax_weights: We shared the target-side word embedding and target-side pre-softmax parameters
shared_source_target_embedding: We used separate source and target vocabulary, so the source-side word embedding and target-side word embedding were not shared.
aan_mask:

This setting enables the mask-matrix multiplication for accumulative-average computation.
Without this setting, we used the native tf.cumsum() implementation.
In practice, the speed of both implementations is similar.
For long target sentences, we recommend the native implementation, because it is more memory-compact.

use_ffn:

With this setting, the AAN model includes the FFN layer as presented in Eq. (1) in our paper.
Why we add this option?
- Because FFN introduces many model parameters, and significantly slows our model.
- Without FFN, our AAN can generate very similar performance, as shown in Table 2 in our paper.
- Furthermore, we surprisingly find that in some cases, removing FFN improves AAN's performance.

batch_size, device_list, update_cycle: This is used for parallel training. For one training step, the training procedure is as follows:

for device_i in device_list: (this runs in parallel):  
	for cycle_i in range(update_cycle): (this runs in sequence):  
		train a batch of size `batch_size`
		collect gradients and costs
update the model

Therefore, the actual training batch size is: batch_size x len(device_list) x update_cycle.

In our paper, we train the model in one GPU card, so we only set the device_list to [0]. For researchers who have more available GPU card, we encourage you to reduce the update_cycle and increase the device_list. This can improve your training speed. Particularly, training one model for WMT 14 en-de with batch_size=3125, device_list=[0,1,2,3,4,5,6,7], update_cycle=1 takes less than 1 day.

Discussions

We have received several discussions from other researchers, and we'd like to show some great discussion here.

Why AAN can accelerate the Transformer with a factor of 4~7?
The acceleration is for Transformer without cache strategy
In theory,
Suppose both the source and target sentence have a length of n_s and n_t respectively, and the model dimension is d. In one step of the Transformer decoder, the original model has a computational complexity of O([n_tgt d^2] (self-attention) + [n_src d^2] (cross-attention) + [d^2] (FFN)). By contrast, the AAN has a computational complexity of O([d^2] (AAN FFN+Gate) + [n_src d^2] (cross-attention)).
Therefore, the theoretical acceleration is around (n_tgt + n_src) / n_src, and the longer the target sentence is, the larger the acceleration will be.

Welcome more discussions :).

Citation

Please cite the following paper:

Biao Zhang, Deyi Xiong and Jinsong Su. Accelerating Neural Transformer via an Average Attention Network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.

@InProceedings{zhang-Etal:2018:ACL2018accelerating,
  author    = {Zhang, Biao and Xiong, Deyi and Su, Jinsong},
  title     = {Accelerating Neural Transformer via an Average Attention Network},
  booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics},
  month     = {July},
  year      = {2018},
  address   = {Melbourne, Australia},
  publisher = {Association for Computational Linguistics},
}

Contact

For any further comments or questions about AAN, please email <a href="mailto:b.zhang@ed.ac.uk">Biao Zhang</a>.