Awesome

Optimizing Multi-task Training through Dynamic Pipelines

Official repository for the paper DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines (camera-ready link pending).

During multi-task training, the model commonly receives input sequences of highly different lengths due to the diverse contexts of different tasks. Padding (to the same sequence length) or packing (short examples into long sequences of the same length) is usually adopted to prepare input samples for model training, which is nonetheless not space or computation efficient. This project adopts a dynamic micro-batching approach to tackle sequence length variation. Each input global batch is split into multiple variable-length micro-batches, each of which comprises a (potentially different) number of samples of similar sequence lengths. These micro-batches are efficiently organized into pipelines, facilitating efficient 3D-parallel (data, tensor and pipeline) multi-task model training.

Main features of this project include:

An efficient dynamic programming algorithm to compute the optimal micro-batching plan for each input global batch.
A pipeline schedule robust to variable-sized micro-batches, minimizing pipeline bubbles.
A pipeline executor supporting highly dynamic pipelines (the pipeline schedule, the size and number of micro-batches can vary each iteration), based on an instruction-based abstraction of pipeline operations.
Overlapped execution plan generation with model training.

System Diagram

Getting Started

Dependencies

Redis

The distributed instruction store uses Redis as the underlying key-value store. Redis server needs to be installed on machines participating in training. Our code will setup and initialize a Redis server automatically.

Python Dependencies

Please see requirements.txt for the required Python packages. Install them by running

pip3 install -r requirements.txt

Installation

Clone this repository and run

pip3 install -e .

Then, build the C++ extensions by running

cd dynapipe/data_opt
make
cd ../memory_opt
python3 setup.py build

Pipeline Instructions

To use this project, the Pipeline Instructions (defined here) needs to be implemented using the intented training framework (e.g., Megatron-LM). A reference implementation of the instructions in Megatron-LM can be found here.

Using this project

Please note that this project is experimental and only tested on integrating with Megatron-LM (please refer to the linked repository for detailed usage).

This project interacts with the training framework mainly through the following two interfaces:

Data Loader

We wrap the micro-batch splitting and execution plan generation process into a DynaPipeDataLoader. It takes the normal PyTorch data loader arguments with a few additional ones. Please see here for the full list of arguments. The returning iterator will generate tuples of micro-batched data and the corresponding execution plan for each iteraton. This iterator is to be used by the pipeline executor. See here for an example of using the DynaPipeDataLoader in Megatron-LM.

Pipeline Executor

The pipeline executor simply reads in execution plans and calls the Pipeline Instruction Implementations. These implementations are registered to the executor through the register_handler function. To run the pipeline executor, simply call the execute function with the corresponding execution plan in each iteration. See here for an example of using the pipeline executor in Megatron-LM.

Environment Variables

Except for the above two interfaces, this project can also be configured through the following environment variables:

DYNAPIPE_KV_HOST: The host IP of the Redis kv store server. Default to 'localhost' (requried for multi-node training).
DYNAPIPE_KV_PORT: The port for the Redis kv store server. Default to 29500.
DYNAPIPE_DEBUG: Logging level. Default to 'INFO'. Set to 'DEBUG' for more detailed logging.
DYNAPIPE_LOGGING_DEBUG_DIR: The directory to store all generated logs.
DYNAPIPE_DEBUG_DUMP_EP_STATS: if set true, dump the generated execution plans, seen sequence lengths, shapes of the generated micro-batches, estimated memory and simulated traces for each iteration during training. Used for debugging and for collecting statistics during our experiments.
DYNAPIPE_DEBUG_DUMP_EP_PREFIX: the directory for dumping the above artifacts.

Code Structure

├── dynapipe
│   : main source folder
│   ├── data_opt
│   │   : code for micro-batch splitting and cost models
│   ├── memory_opt
│   │   : contains the modified cuda caching memory allocator 
│   │     from PyTorch
│   ├── pipe
│   │   : contains implementation of pipeline instructions,
│   │     executor, and the distributed instruction store
│   ├── schedule_opt
│   │   : code for computing pipeline schedule
│   └── utils
│       : other util codes like logger
├── scripts
│   : utility scripts for various purposes 
├── tests
│   : unit tests of different modules

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.