Home

Awesome

Multilingual-GLM

This repository contains the code of mGLM: a multilingual variant of GLM, a general language model trained with an autoregressive blank infilling objective.

You may want to check out our interactive demo based on mGLM that generates a brief Chinese/English summary for your article in any commonly used language.

The backbone structure of this model is based on GLM: General Language Model Pretraining with Autoregressive Blank Infilling (Du et al., ACL 2022)

Code is mainly based on THUDM/GLM. Part of the code is also based on Megatron-LM and PET.

Parameters

Here we provide a comparison between the sizes of different multilingual language models.

ModelParameters
mBERT180M
XLM-R550M
MT5-Large1.2B
GLM-Large1B

Pretrained Models

You can download Our Pretrained Checkpoint and specify the checkpoint path in a script. The multilingual tokenizer and configuration file of our model are already included in this repo.

Test Results

Tasks in XTREME Benchmark

ModelXNLIPAWS-XXQuADMLQATyDiQA
GLM-Large75.685.283.6/71.967.52/54.3469.6/55.6
MT5-Large81.188.977.8/61.571.2/51.769.9/52.2

Neural Cross Lingual Summarization

The following table contains our test results for the NCLS English to Chinese(EN2ZHSUM) dataset

Metric is Rouge-1/Rouge-2/Rouge-L

ModelNCLS English to Chinese
GLM-Large50.27/30.94/38.44
MT5-Large(Reproduced)42.31/22.40/31.33

Get Started

<!-- ### Docker Image We prepare two docker images based on CUDA 10.2 and CUDA 11.2. You can pull the pre-built images from Docker Hub and run with docker v19.03+ ```shell docker run --gpus all --rm -it --ipc=host zxdu20/glm-cuda102 ``` or replace `glm-cuda102` with `glm-cuda112`. You can also modify the image according to your requirements in [docker/cuda102.dockerfile](docker/cuda102.dockerfile) and build the image yourself ```shell docker build -f cuda102.dockerfile . -t glm-cuda102 ``` -->

Manual Installation

Please first install PyTorch pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html --no-cache-dir and apex.

Then install other dependencies pip3 install -r requirements.txt

Usage

XTREME

  bash scripts/ds_finetune_superglue.sh \
     config_tasks/model_blocklm_multilingual_large.sh \
     config_tasks/task_xnli.sh
  bash scripts/ds_finetune_seq2seq.sh  \
    config_tasks/model_blocklm_multilingual_large.sh  \
    config_tasks/seq_mlqa.sh

Cross-lingual Summary

  bash scripts/ds_finetune_summary.sh  \
    config_tasks/model_blocklm_multilingual_large.sh  \
    config_tasks/seq_ncls.sh

Blank Filling(Interactive)

  bash scripts/generate_block.sh  \
    config_tasks/model_blocklm_multilingual_large.sh

Model Parallelism

If your encounter the CUDA out of memory error, which means you GPU memory is limited, you can try the model parallelism to divide the parameters into multiple GPUs. Take the two-way model parallelism as an example. First run change_mp.py to divide the checkpoint:

  python3 change_mp.py path_to_the_checkpoint 2

Then update the checkpoint path in the model config file (such as config_tasks/model_blocklm_multilingual_large.sh) and change MP_SIZE in the script (such as scripts/ds_finetune_superglue.sh) to 2.

Pretrain

Run the following script to pre-train the mGLM-Large model

  bash scripts/ds_pretrain_nvidia.sh config/ds_multi_blockta_large.sh

The script scripts/ds_pretrain_nvidia.sh launches the training program with DeepSpeed. You should change NUM_WORKERS and NUM_GPUS_PER_WORKER to the number of workers and the number of gpus per worker. Also change HOST_FILE_PATH to the path to an OpenMPI-style hostfile. More details about DeepSpeed launcher can be found here.

The file config/ds_multi_blockta_large.sh defines the hyperparameters for pretraining. Most of the arguments are fairly self-explanatory. Specifically, --train-data can be multiple keywords defined in NAMED_CORPORA in data_utils/corpora.py. The hyperparameters of the optimizer are defined in the corresponding json file under config. The semantics of the json file can be found here.

MT5 Reproduction

The code for reproducing experiments in MT5 is at mt5/finetune_mt5.py. We use a tool called wandb to track our experiments. After signing up for a new account, use wandb login --relogin to login. You can also use wandb offline to turn off wandb synchronizing your experiment online.

If you only want to use one GPU to train, use

  cd mt5
  python3 finetune_mt5.py scisummnet simple

to train on the scisummnet dataset.

Our distributed training is automated with Accelerate. accelerate config sets up the configuration for distributed training. accelerate test runs a sanity check.

  cd mt5
  accelerate launch finetune_mt5.py scisummnet simple

runs the training on the scisummnet dataset.

Citation

Citation for the GLM paper:

@inproceedings{du-etal-2022-glm,
    title = "{GLM}: General Language Model Pretraining with Autoregressive Blank Infilling",
    author = "Du, Zhengxiao  and
      Qian, Yujie  and
      Liu, Xiao  and
      Ding, Ming  and
      Qiu, Jiezhong  and
      Yang, Zhilin  and
      Tang, Jie",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.26",
    doi = "10.18653/v1/2022.acl-long.26",
    pages = "320--335",
    abstract = "There have been various types of pretraining architectures including autoencoding models (e.g., BERT), autoregressive models (e.g., GPT), and encoder-decoder models (e.g., T5). However, none of the pretraining frameworks performs the best for all tasks of three main categories including natural language understanding (NLU), unconditional generation, and conditional generation. We propose a General Language Model (GLM) based on autoregressive blank infilling to address this challenge. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, which results in performance gains over BERT and T5 on NLU tasks. Meanwhile, GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and unconditional generation, GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25{\mbox{$\times$}} parameters of BERT Large , demonstrating its generalizability to different downstream tasks.",
}

Citation for the Multilingual GLM paper to be released