Awesome

📄 Sequence Parallel

🔗Table of Contents

📄 Sequence Parallel

📌 Introduction

In this codebase, we implemented BERT with sequence parallelism. Sequence parallelism splits the input tensor and intermediate activation along the sequence dimension. This method can achieve better memory efficiency and allows us to train with larger batch size and longer sequence length.

Paper: Sequence Parallelism: Long Sequence Training from System Perspective

🛠 Environment Setup

To run this codebase, the following environment is required:

CUDA: 11.3
Python: > 3.7
PyTorch: 1.11.0

You can follow the script below to set up the environment for this codebase.

# create conda environment
conda create -n seq python=3.9
conda activate seq

# install PyTorch
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113

# install nvidia apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex.git@22.03

# install apex
pip install -v -r requirements.txt

🪅 Dataset Preparation

For this codebase, we will use the Wikipedia dataset and process the dataset with Megatron-LM's preprocessing script.

Step 1: Download and extract the Wikipedia Dataset

You can use the following script to download and extract the Wikipedia dataset. The extracted corpus will be stored in ./dataset/raw/corpus.json.

Execute the scripts in the root directory of this codebase

# go to the root directory
cd Sequence-Parallelism

# create dataset workspace
mkdir dataset && cd ./dataset

# download
mkdir raw && cd ./raw 
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

# install wiki extractor
pip install git+https://github.com/FrankLeeeee/wikiextractor.git@v3.0.6

# extract text
wikiextractor --json enwiki-latest-pages-articles.xml.bz2
cat text/*/* > ./corpus.json

Step 2: Process the Wikipedia dataset

The following commands can be used to process the Wikipedia dataset. Many thanks to Megatron-LM for providing a preprocessing script to generate the corpus file.

# go to the root directory
cd Sequence-Parallelism

# download vocab file
mkdir vocab && cd ./vocab
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
cd ..

# clone the Megatron-LM repository
git clone https://github.com/NVIDIA/Megatron-LM.git
cd ./Megatron-LM
git checkout b037a69eb2622bd824d88cc230ada94077b3716f

# run processing
python tools/preprocess_data.py \
    --input ../dataset/raw/corpus.json \
    --output-prefix my-bert \
    --vocab ../vocab/bert-large-uncased-vocab.txt \
    --dataset-impl mmap \
    --tokenizer-type BertWordPieceLowerCase \
    --split-sentences \
    --workers 48

# move the processed data to the dataset directory
cd ..
mkdir -p dataset/processed
mv Megatron-LM/my-bert_text_sentence.* ./dataset/processed

After running these commands, you should see the following files in dataset/processed:

my-bert_text_sentence.bin
my-bert_text_sentence.idx

🚀 Run the codebase

We provided train.py for you to execute training. Before invoking the script, there are several steps to perform.

Step 1. Set data path and vocab path

At the top of config.py, you can see two global variables DATA_PATH and VOCAB_FILE_PATH.

DATA_PATH = './dataset/processed/my-bert_text_sentence'
VOCAB_FILE_PATH = './vocab/bert-large-uncased-vocab.txt'

DATA_PATH refers to the path to the data file generated by Megatron's script. For example, in the section above, you should get two data files (my-bert_text_sentence.bin and my-bert_text_sentence.idx). You just need to DATA_PATH to the path to the bin file without the file extension.

The VOCAB_FILE_PATH refers to the path to the vocabulary downloaded when you prepare the dataset (e.g. bert-large-uncased-vocab.txt).

Step 3. Make Dataset Helper

Build BERT dataset helper. Requirements are CUDA, g++, pybind11 and make.

cd ./data/datasets
make

Step 3. Configure your parameters

In the config.py provided, a set of parameters are defined including training scheme, model, etc. You can also modify the ColossalAI setting. For example, if you wish to parallelize over the sequence dimension on 8 GPUs. You can change size=4 to size=8. If you wish to use pipeline parallelism, you can set pipeline=<num_of_pipeline_stages>.

Step 4. Invoke parallel training

Lastly, you can start training with sequence parallelism with the following command. You need to replace <num-of-gpus> with the number of GPUs you want to use.

python -m torch.distributed.launch --nproc_per_node <num-of-gpus> --master_addr localhost --master_port 29500 train.py