Home

Awesome

ContinualMT

Quick Links

Introduction

We introduce ContinualMT, an adaptable continual learning framework tailored for neural machine translation (NMT). It is crafted to promote the research of continual learning (CL) within the realm of NMT.

Our repository encompasses a PyTorch implementation of a suite of state-of-the-art (SoTA) methods, all adhering to a unified training and evaluation protocol. Presently, the supported methods include:

We are actively working on implementing other methods and adding them to this framework!

Dataset

Currently, our framework focus on the multi-stage domain incremental training of NMT system. Within our framework, you can utilize machine translation data from various domains for domain incremental training. We offer a representative multi-domain machine translation dataset: OPUS multi-domains dataset. This dataset comprises German-English parallel data across five domains: Medical, Law, IT, Koran, and Subtitle. The dataset can be found here.

Architecture

Our implementation is built upon fairseq, with the following modifications:

./approaches: code for supported continual learning approaches
./cl_scripts: bash scripts for continual training
./cl_scripts_slurm: slurm scripts for continual training
./lcheckpoints: all training checkpoints are saved in this folder
./logs: training logs
./pretrained_models: folder for pretrained NMT models
./task_sequence: reference sequences for OPUS multi-domain MT data

Installation

Firstly, build the enviorment from the provied YAML file. conda env create --name CLMT --file CLMT.yaml

Install fairseq (pip install --editable .), moses and fastBPE.

Preparing and Preprocessing

Pre-trained Model

Download the pre-trained WMT19 German-English model from fairseq, along with the dictionaries and the bpecodes.

Data

Firstly, navigate to the data folder by running cd ./examples/translation. Ensure that you have set the paths for the Moses scripts, fastBPE, the model dictionaries, and the BPE codes in the scripts.

For general domain MT data, you can simply execute the provided preprocessing script prepare-wmt17de2en.sh, which automatically download and prepare the data.

For domain incremental training data, download the mult-domain data and unzip it. Then process each domain with the prepare-domain-adapt.sh script.

Finally, use the preprocess.sh script to prepare the binary files for fairseq.

Domain Incremental Training

We offer training bash scripts for all supported approaches in ./cl_scripts and ./cl_scripts_slurm. For more detailed information, please refer to the individual readme files located in each directory.

Extending for New Approaches

Expanding our framework is straightforward. To integrate new CL approaches, you simply need to make modifications within the ./approaches, ./cl_scripts and ./cl_scripts_slurm directories.

Reference

We highly appreciate your act of staring and citing. Your attention to detail and recognition is greatly valued.

@misc{wu2024fmalloc,
      title={F-MALLOC: Feed-forward Memory Allocation for Continual Learning in Neural Machine Translation}, 
      author={Junhong Wu and Yuchen Liu and Chengqing Zong},
      year={2024},
      eprint={2404.04846},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}