Awesome

Overcoming Catastrophic Forgetting beyond Continual Learning: Balanced Training for Neural Machine Translation

This repository contains the source code for our ACL 2022 paper Overcoming Catastrophic Forgetting beyond Continual Learning: Balanced Training for Neural Machine Translation pdf. Our method is implemented based on the open-source toolkit fairseq. We mainly modified train.py and cokd_loss.py.

Requirements

This system has been tested in the following environment.

Python version = 3.8
Pytorch version = 1.7

Replicate the TED results

Pre-processing

We use the tokenized TED dataset released by VOLT, which can be downloaded from here and pre-processed into subword units by prepare-ted-bilingual.sh.

We provide the pre-processed TED En-Es dataset in this repository. First, process the data into the fairseq format.

TEXT=./data
python preprocess.py --source-lang en --target-lang es \
        --trainpref $TEXT/es-en.train \
        --validpref $TEXT/es-en.valid \
        --testpref $TEXT/es-en.test \
        --destdir data-bin/tedbpe10kenes \
        --nwordssrc 10240 --joined-dictionary  --workers 16

Training

To train the Transformer baseline, run the following command.

data_dir=data-bin/tedbpe10kenes
save_dir=output/enes_base

python train.py $data_dir \
    --fp16 --dropout 0.3  --save-dir $save_dir \
    --arch transformer_wmt_en_de --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --lr 0.0007 --min-lr 1e-09 \
    --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 4096 --update-freq 1\
    --no-progress-bar --log-format json --log-interval 100 --save-interval-updates 1000 \
    --max-update 18000 --keep-interval-updates 10 --no-epoch-checkpoints
    
python scripts/average_checkpoints.py --inputs $save_dir \
 --num-update-checkpoints 5  --output $save_dir/average-model.pt

To train the COKD model, run the following command.

data_dir=data-bin/tedbpe10kenes
save_dir=output/enes_cokd

python train.py $data_dir \
    --fp16 --dropout 0.2 --kd-alpha 0.95 --num-teachers 1 --save-dir $save_dir \
    --arch transformer_wmt_en_de --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --lr 0.0007 --min-lr 1e-09 \
    --weight-decay 0.0 --criterion cokd_loss --label-smoothing 0.1 --max-tokens 4096 --update-freq 1\
    --no-progress-bar --log-format json --log-interval 100 --save-interval-updates 1000 \
    --max-update 18000 --keep-interval-updates 10 --no-epoch-checkpoints
    
python scripts/average_checkpoints.py --inputs $save_dir \
 --num-update-checkpoints 5  --output $save_dir/average-model.pt

The above commands assume 8 GPUs on the machine. When the number of GPUs is different, adapt --update-freq to make sure that the batch size is 32K.

Inference

Run the following command for inference.

python generate.py data-bin/tedbpe10kenes  --path output/enes_cokd/average-model.pt --gen-subset test --beam 5 --batch-size 100 --remove-bpe --lenpen 1 > out
# because fairseq's output is unordered, we need to recover its order
grep ^H out | cut -f1,3- | cut -c3- | sort -k1n | cut -f2- > pred.es
sed -r 's/(@@ )|(@@ ?$)//g' data/es-en.test.es > ref.es
perl multi-bleu.perl ref.es < pred.es

The expected BLEU scores are 40.86 for the Transformer baseline and 42.50 for the COKD model.

Citation

If you find the resources in this repository useful, please cite as:

@inproceedings{cokd,
  title = {Overcoming Catastrophic Forgetting beyond Continual Learning: Balanced Training for Neural Machine Translation},
  author= {Chenze Shao and
               Yang Feng},
  booktitle = {Proceedings of ACL 2022},
  year = {2022},
}