Home

Awesome

TDS

Release note|中文文档

Introduction

How to use TDS

  1. The first step is to install DeepSpeed. How to install DeepSpeed can refer to DeepSpeed Installation.

  2. Copy the folder "tds" into your project, and use "import tds as deepspeed" instead of "import deepspeed" in your code.

  3. If you want to use pipeline-parallel training, you must add the code to let your model know some essential settings for its forward and backward operations. These settings consist of tensor (including both input data and hidden states) types, whether these tensors need to save gradients, and whether these tensors need to be partitioned across GPUs to save memory. We take training GPT-2 as an example, the detailed code can be found from GPT-2.

    • The code of using DeepSpeed
    def model_provider():
        """Build the model for GPT-2."""
        args = get_args()
        print_rank_0('building GPT2 model ...')
        if args.pipe_parallel_size == 0:
            model = GPT2Model(num_tokentypes=0, parallel_output=True)
        else:
            model = GPT2ModelPipe(num_tokentypes=0, parallel_output=True, topology=mpu.get_topology())
            model._megatron_batch_fn = get_batch_pipe
        return model
    
    • The code of using TDS
    def model_provider():
    """Build the model for GPT-2."""
    args = get_args()
    print_rank_0('building GPT2 model ...')
    if args.pipe_parallel_size == 0:
        model = GPT2Model(num_tokentypes=0, parallel_output=True)
    else:
        model = GPT2ModelPipe(num_tokentypes=0, parallel_output=True, topology=mpu.get_topology())
        model._megatron_batch_fn = get_batch_pipe
        # The first input tensor is input embeddings and hidden states, it requires to save its gradients. The second input tensor is attention mask. 
        model._input_grad = [True, False]
        # The first input tensor is input embeddings and hidden states, its type is float. The second input tensor is attention mask, its type is boolean.
        model._input_type = ['float', 'bool']
        # Input embeddings and hidden states can be partitioned across GPUs to save memory.
        model._input_pipe_partitioned = [True, False]
    return model
    
  4. All other operations can directly follow DeepSpeed and DeepSpeedExamples.

Examples

More examples like using TDS for GPT-2 and T5 can refer to CPM-Pretrain.

Citation

If you use the code, please cite the following paper:

@article{cpm-v1,
  title={CPM: A Large-scale Generative Chinese Pre-trained Language Model},
  author={Zhang, Zhengyan and Han, Xu, and Zhou, Hao, and Ke, Pei, and Gu, Yuxian and Ye, Deming and Qin, Yujia and Su, Yusheng and Ji, Haozhe and Guan, Jian and Qi, Fanchao and Wang, Xiaozhi and Zheng, Yanan and Zeng, Guoyang and Cao, Huanqi and Chen, Shengqi and Li, Daixuan and Sun, Zhenbo and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Tang, Jie and Li, Juanzi and Sun, Maosong},
  year={2020}
}