Home

Awesome

transformer-aan

Source code for "Accelerating Neural Transformer via an Average Attention Network"

The source code is developed upon <a href="https://github.com/thumt/THUMT">THUMT</a>

The used THUMT for experiments in our paper is downloaded at Jan 11, 2018

About AAN Structure

We introduce two sub-layers for AAN in our ACL paper: one FFN layer (Eq. (1)) and one gating layer (Eq. (2)). However, after our extensive experiments, we observe that the FFN layer is redundant and can be removed without loss of translation quality. In addition, removing FFN layer reduces the amount of model parameters and slightly improves the training speed. It also largely improves the decoding speed.

For re-implementation, we suggest other researchers to use the AAN model without the FFN sub-layer! See how we disable this layer.

Other Implementations

File structure:

train.sh: provides the training script with our used configuration. test.sh: provides the testing script.

directory train and test are generated on WMT14 en-de translation task.

The processed WMT14 en-de dataset can be found at <a href="https://drive.google.com/open?id=15WRLfle66CO1zIGKbyz0FsFmUcINyb4X">Transformer-AAN-Data</a>. (Original files are downloaded from <a href="https://nlp.stanford.edu/projects/nmt/">Stanford NMT website</a>.)

Requirements

Training Parameters

batch_size=3125,device_list=[0],eval_steps=5000,train_steps=100000,save_checkpoint_steps=1500,shared_embedding_and_softmax_weights=true,shared_source_target_embedding=false,update_cycle=8,aan_mask=True,use_ffn=False
  1. train_steps: The total training steps, we used 100000 in most experiments.
  2. eval_steps: We obtain the approximate BLEU score on development set in every 5000 training steps.
  3. shared_embedding_and_softmax_weights: We shared the target-side word embedding and target-side pre-softmax parameters
  4. shared_source_target_embedding: We used separate source and target vocabulary, so the source-side word embedding and target-side word embedding were not shared.
  5. aan_mask:
  1. use_ffn:
  1. batch_size, device_list, update_cycle: This is used for parallel training. For one training step, the training procedure is as follows:
for device_i in device_list: (this runs in parallel):  
	for cycle_i in range(update_cycle): (this runs in sequence):  
		train a batch of size `batch_size`
		collect gradients and costs
update the model

Therefore, the actual training batch size is: batch_size x len(device_list) x update_cycle.

Discussions

We have received several discussions from other researchers, and we'd like to show some great discussion here.

  1. Why AAN can accelerate the Transformer with a factor of 4~7?
    The acceleration is for Transformer without cache strategy
    In theory,
    Suppose both the source and target sentence have a length of n_s and n_t respectively, and the model dimension is d. In one step of the Transformer decoder, the original model has a computational complexity of O([n_tgt d^2] (self-attention) + [n_src d^2] (cross-attention) + [d^2] (FFN)). By contrast, the AAN has a computational complexity of O([d^2] (AAN FFN+Gate) + [n_src d^2] (cross-attention)).
    Therefore, the theoretical acceleration is around (n_tgt + n_src) / n_src, and the longer the target sentence is, the larger the acceleration will be.

Citation

Please cite the following paper:

Biao Zhang, Deyi Xiong and Jinsong Su. Accelerating Neural Transformer via an Average Attention Network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.

@InProceedings{zhang-Etal:2018:ACL2018accelerating,
  author    = {Zhang, Biao and Xiong, Deyi and Su, Jinsong},
  title     = {Accelerating Neural Transformer via an Average Attention Network},
  booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics},
  month     = {July},
  year      = {2018},
  address   = {Melbourne, Australia},
  publisher = {Association for Computational Linguistics},
}

Contact

For any further comments or questions about AAN, please email <a href="mailto:b.zhang@ed.ac.uk">Biao Zhang</a>.