Awesome

TDS

Introduction

Tsinghua/Temporary DeepSpeed (TDS) is a plug-in of Microsoft DeepSpeed to fix the bug of the DeepSpeed PipelineEngine.
Although DeepSpeed provides interfaces to support pipeline-parallel training. There are still some bugs and hack implementation in its code, especially the code to send tensors between different stages. We thus reimplement the PipelineEngine of DeepSpeed in TDS.

How to use TDS

The first step is to install DeepSpeed. How to install DeepSpeed can refer to DeepSpeed Installation.
Copy the folder "tds" into your project, and use "import tds as deepspeed" instead of "import deepspeed" in your code.

If you want to use pipeline-parallel training, you must add the code to let your model know some essential settings for its forward and backward operations. These settings consist of tensor (including both input data and hidden states) types, whether these tensors need to save gradients, and whether these tensors need to be partitioned across GPUs to save memory. We take training GPT-2 as an example, the detailed code can be found from GPT-2.

The code of using DeepSpeed

def model_provider():
    """Build the model for GPT-2."""
    args = get_args()
    print_rank_0('building GPT2 model ...')
    if args.pipe_parallel_size == 0:
        model = GPT2Model(num_tokentypes=0, parallel_output=True)
    else:
        model = GPT2ModelPipe(num_tokentypes=0, parallel_output=True, topology=mpu.get_topology())
        model._megatron_batch_fn = get_batch_pipe
    return model

The code of using TDS

def model_provider():
"""Build the model for GPT-2."""
args = get_args()
print_rank_0('building GPT2 model ...')
if args.pipe_parallel_size == 0:
    model = GPT2Model(num_tokentypes=0, parallel_output=True)
else:
    model = GPT2ModelPipe(num_tokentypes=0, parallel_output=True, topology=mpu.get_topology())
    model._megatron_batch_fn = get_batch_pipe
    # The first input tensor is input embeddings and hidden states, it requires to save its gradients. The second input tensor is attention mask. 
    model._input_grad = [True, False]
    # The first input tensor is input embeddings and hidden states, its type is float. The second input tensor is attention mask, its type is boolean.
    model._input_type = ['float', 'bool']
    # Input embeddings and hidden states can be partitioned across GPUs to save memory.
    model._input_pipe_partitioned = [True, False]
return model

All other operations can directly follow DeepSpeed and DeepSpeedExamples.

Examples

More examples like using TDS for GPT-2 and T5 can refer to CPM-Pretrain.

Citation

If you use the code, please cite the following paper:

@article{cpm-v1,
  title={CPM: A Large-scale Generative Chinese Pre-trained Language Model},
  author={Zhang, Zhengyan and Han, Xu, and Zhou, Hao, and Ke, Pei, and Gu, Yuxian and Ye, Deming and Qin, Yujia and Su, Yusheng and Ji, Haozhe and Guan, Jian and Qi, Fanchao and Wang, Xiaozhi and Zheng, Yanan and Zeng, Guoyang and Cao, Huanqi and Chen, Shengqi and Li, Daixuan and Sun, Zhenbo and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Tang, Jie and Li, Juanzi and Sun, Maosong},
  year={2020}
}