Home

Awesome

Aceso Artifact Instructions

Overall Workflow

The workflow of Aceso is:

In this artifact evaluation, we will check the functionality of Aceso, by working through the full process (profile + search + train) with a small setup (4GPUs). And we encourage you to perform part of the large-scale experiments (only the search step) in the Aceso paper because the search step does not require GPUs and we will provide our profiled database.

Set up the environment

Functionality check with small-scale experiments (4 hours)

Hardware requirement: 4 GPUs.

The models chosen for the small-scale experiments: GPT-3(1.3B), T5(770M), and Wide-ResNet(1B). The global batch size is set to 512 for GPT-3 and T5 models and 768 for Wide-ResNet.

In this small-scale experiment, we will run the search & train steps of Aceso and the two baselines (Alpa and Megatron-LM), at last we will compare the training throughput and search cost. We will also check the prediction accuracy of Aceso's performance model.

You can follow the instructions to perform the experiments step by step. And you can also execute only one script, which contains all the steps:

bash scripts/run_all_small.sh

(Optional) Step 1: Profile (40 minutes)

The profile step can be skipped in the artifact as we provided a pre-profiled database profiler/profiled-time-miniset/ for the small-scale experiment. But you can also profile on your own by executing:

cd profiler
bash scripts/profile_small.sh

Step 2: Search (6 minutes)

Run the search for GPT-3(1.3B), T5(770M) and Wide-ResNet(1B) models:

Step 3: Train (6 minutes)

We will train each model with the found configurations for 3 iterations to get the iteration time:

Example output:

-------- gpt End-to-end throughput --------
Size	 Batch Size	 Time(s)	 Thpt(samples/s)
1_3B	 512.0		 58.98		 8.68

Step 4: Compare with Alpa & Megatron-LM (3 hours)

Step 5: Check performance model accuracy

python3 scripts/get_perf_model_acc.py small

Expected results:

-------- [gpt] Time Prediction (s) --------
Size     Actual          Predict
1_3B    59011.61         58760.92
-------- [gpt] Memory Prediction (MB) --------
Size     Actual          Predict (normal + extra)
1_3B    10692.00         11940.64 (10660.64 + 1280.00)
-------- [t5] Time Prediction (s) --------
Size     Actual          Predict
770M    33823.79         33659.50
-------- [t5] Memory Prediction (MB) --------
Size     Actual          Predict (normal + extra)
770M    10470.00         11981.22 (11319.47 + 661.75)
-------- [resnet] Time Prediction (s) --------
Size     Actual          Predict
1B      18662.83         17672.05
-------- [resnet] Memory Prediction (MB) --------
Size     Actual          Predict (normal + extra)
1B      14060.00         11866.20 (10334.83 + 1531.38)

Reproducing results with large-scale experiments

Hardware requirement: The full evaluation conducted in the paper requires 4 nodes with 8 V100(32GB) GPUs in each. But you can still check part of the results described in the paper because the search step does not require GPUs.

(Optional) Step 1: Profile (2 hours)

Hardware requirement: 32 GPUs (4 nodes * 8 GPUs/node).

The profile step can be skipped in the artifact as we provided a pre-profiled database profiler/profiled-time-eurosys/ for the large-scale experiment. But you can also profile on your own: (all the profiled results will be saved into profiler/profiled-time-eurosys-new/ by default)

Step 2: Search (45 minutes)

Hardware requirement: CPU only. The following script will run the search of all the model sizes considered in the paper:

All the found configs will be saved into logs/aceso/configs/[model_name]/[model_size]/top_configs/ as .json files. We have shown two case studies in our paper, about the config of GPT-3 1.3B and Wide-ResNet 6.8B, in Sec 5.4. You can check the found configs and compare them with the ones in case studies.

Step 3: Train (2hours)

Hardware requirement: 32 GPUs (4 nodes * 8 GPUs/node).

Step 4: Compare with Alpa & Megatron-LM (29 hours)

Hardware requirement: 32 GPUs (4 nodes * 8 GPUs/node).

Scale to 1K layers (7.5 hours)

Hardware requirement: 8 GPUs (1 node * 8 GPUs/node)

In this experiment, we will run the search and train step of a customized GPT model, scaling the number of layers from 8 to 1024.

Detailed usage of Aceso

This section is a reference on the detailed usage of each Aceso component: profiler, search algorithm, and runtime.

Profiling

Profile op-related information: (e.g., forward/backward execution time, input/output size, weight size, and reserved memory size)

## In the `Aceso/profiler` path
python3 op_profiler.py \
    --prof-tp-size 1 \
    --prof-path PATH_TO_RESULT \
    --prof-cache-file PATH_TO_CACHE_FILE \
    --prof-model-name gpt \
    --prof-model-size all \
    --prof-repeat-times 40 10 \
    --prof-repeat-threshold 5000 \
    --prof-warmup-times 10 \
    --prof-warmup-threshold 100000

Arguments:

Profile collective communication time:

## In the `Aceso/profiler` path
python3 comm_profiler.py \
    --prof-path PATH_TO_RESULT \
    --prof-cache-file PATH_TO_CACHE_FILE \
    --prof-op-time-path PATH_TO_OP_PROFILING_RESULT \
    --prof-tp-size 1 \
    --prof-model-name gpt \
    --prof-model-size all \
    --prof-warmup-time 10 \
    --prof-repeat-time 40

Arguments:

All the profiled results will be saved as .csv files under the given --prof-path.

Search

Search for the best configs given model information, hardware information, profiled database, and search-related hyper-parameters:

## In the `Aceso/search` path
python3 aceso_search.py \
    --model-name gpt \
    --model-size 1_3B \
    --global-batch-size 1024 \
    --micro-batch-size 1 2 4 8 \
    --num-nodes 1 \
    --num-gpus-per-node 4 \
    --memory-limit 12000 \
    --log-path PATH_TO_LOGS \
    --profiled-time-path PATH_TO_PROFILING_RESULT \
    --config-save-path PATH_TO_SAVE_CONFIGS \
    --config-suffix UNIQUE_CONFIG_SUFFIX \
    --max-num-hops 7 \
    --time-budget-total 200

All the found configs will be saved into --config-save-path as .json files. Here is one example config:

{
    "model_name": "gpt",
    "model_size": "1_3B",
    "num_layers": 24,
    "seq_length": 2048,
    "max_position_embeddings": 2048,
    "num_attention_heads": 32,
    "hidden_size": 2048,
    "global_batch_size": 32,
    "micro_batch_size": 1,
    "num_stages": 3,
    "num_gpus": [1, 1, 2],
    "checkpoint_activations": [true, true, false],
    "resharding_stages": [false, false, false],
    "num_ops_in_each_stage": [85, 91, 139],
    "model_parallel_size_of_each_op": [
        [1, 1, 1, .... 1], [1, 1, 1, .... 1], [2, 2, 2, .... 2]
    ],
    "data_parallel_size_of_each_op": [
        [1, 1, 1, .... 1], [1, 1, 1, .... 1], [1, 1, 1, .... 1]
    ],
    "recompute_ops": [
        [0, 0, 0, .... 0], [0, 0, 0, .... 0], [0, 0, 0, .... 0]
    ],
    "algo_of_each_op": [
        [0, 0, 0, .... 0], [0, 0, 0, .... 0], [0, 0, 0, .... 0]
    ]
}

Note that, algo_of_each_op indicates the tensor parallelism dimension. 0 is the default partition dimension, while 1 stands for alternative one, please refer to Sec 4.2 for more details.

Training

To train the model with the found configs by Aceso, run the following command:

## In the `Aceso/runtime` path
python3 -m torch.distributed.launch $DISTRIBUTED_ARGS \
       pretrain_gpt.py \
       --flexpipe-config CONFIG_FILE \
       --train-iters 5 \
       --eval-iters 0 \
       --lr-decay-iters 320000 \
       --vocab-file vocabs/gpt2-vocab.json \
       --merge-file vocabs/gpt2-merges.txt \
       --data-impl mmap \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --lr-decay-style cosine \
       --min-lr 1.0e-5 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
       --log-interval 1 \
       --DDP-impl local \
       --fp16 \
       --log-path LOG_PATH

Arguments:

For the other arguments, please refer to Megatron-LM's documentation.

Troubleshooting

Reference

Please cite Aceso in your publications if it helps your research:

@inproceedings{liu2024aceso,
  title={Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation},
  author={Guodong Liu and Youshan Miao and Zhiqi Lin and Xiaoxiang Shi and Saeed Maleki and Fan Yang and Yungang Bao and Sa Wang},
  booktitle={Proceedings of the Nineteenth European Conference on Computer Systems},
  year={2024}
}

Microsoft Open Source Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct.

Resources: