Home

Awesome

This repository is a fork of Megatron-LM. The original README can be found here.

Zero Bubble Pipeline Parallelism & Pipeline Parallelism with Controllable Memory

Zero Bubble Pipeline Parallelism is a novel pipeline parallelism algorithm able to reduce the bubble of pipeline parallelism to almost zero while preserving synchronous semantics.

Pipeline Parallelism with Controllable Memory is a novel method to build pipeline parallelism schedules with controllable activation memory. Using this method we can significantly reduce the activation memory consumption of pipeline parallelism while maintaining the same throughput or even faster.

Check out our papers at:

A playground for zero bubble schedulers:

Quick settings to enable Zero Bubble:

  --zero-bubble-v-schedule
  --allow-padding-num-layers
  --enable-optimizer-post-validation

Can also try out with ZERO_BUBBLE_V_SCHEDULE=1 examples/pretrain_zero_bubble.sh

Or add another flag to control the memory consumption or V schedules:

  --zero-bubble-v-schedule-mem-setup half

Light-weight alternative options to enable ZB H1 schedule for your own megatron fork

# installed by pip install zbpp_light
import zbpp_light
zbpp_light.patch_megatron()

import megatron
...

Pushing The Parento Frontier of Throughput and Memory Forward

Our series of schedules pushes the parento frontier of throughput and memory forward.

image

Schedules

The key of achieving zero bubble is to breaking a backward pass into a $B$ pass and $W$ pass. $B$ on one stage will only depend on the $B$ on its next stage, compared to depending on both $B$ and $W$ of in 1F1B.

image

By controlling the lifespan of each building block, we can control and lower the activation memory.

image

Comparision of Schedules

image

1F1BZB1PZB2PZBVV-HalfV-Min
Bubble Rate$(p-1)/(m+p-1)=B$$B/3$00$B/2$$2B/3 + O(n) overhead$
Activation Memory <br> (Compared to 1F1B)1x1x2x1x1/2x1/3x
Pipeline Communication Volume <br> (Compared to 1F1B)1x1x1x2x2x2x
<p style="font-size:14px;margin-bottom:0;height:20px;">* p: number of pipeline stages; m: number of microbatches</p> <p style="font-size:14px;margin-bottom:0;height:20px;">* Assuming T<sub>F</sub> = T<sub>B</sub> = T<sub>W</sub></p> <p style="font-size:14px;margin-bottom:0;height:20px;">* Communication volume of DP and TP stays the same</p>

Zero Bubble Command Line Arguments

Notices

Optimizer Post Validation

In most practices of PP there's an all-reduce cross all pipeline stages for numerical robustness, e.g. global gradient norm for gradient clipping. INF/NAN check for mixed precision training, etc. This all-reduce breaks parallelogram and makes zero bubble impossible. Under the observation that during a stable training both the gradient clipping and INF/NAN rarely triggers, we replace the before-hand synchronizations with a post update validation.

image

We eagerly step the optimizers assuming the grad cliping, INF/NAN conditions are not triggered. In case an amendment to the gradient is required, a rollback will be issued and then we redo the optimizer step based on the fully reduced global state.

To enable this feature, add --enable-optimizer-post-validation. Experiments shows NOT enabling this will cause ~8% performance loss.