Awesome
Aceso Artifact Instructions
Overall Workflow
The workflow of Aceso is:
- Step 1: profile basic information (e.g., op computation time and communication time).
- Step 2: search for the best configs using the profiled database.
- Step 3: train the model with the found config.
In this artifact evaluation, we will check the functionality of Aceso, by working through the full process (profile + search + train) with a small setup (4GPUs). And we encourage you to perform part of the large-scale experiments (only the search step) in the Aceso paper because the search step does not require GPUs and we will provide our profiled database.
Set up the environment
- Pull the submodules:
git submodule update --init --recursive
- Patch Megatron-LM:
cd external/Megatron-LM cp ../aceso_ae_megatron.patch ./ git apply --whitespace=nowarn aceso_ae_megatron.patch cd ../..
- Install dependencies:
-
Option 1: docker (recommended)
We provide a
Dockerfile
to build the needed docker image, you can build the docker by executing:docker build -t aceso_image:latest .
After image built, launch a container with the docker image on each node in your cluster: (replace
aceso_path
with the path of this repo in your system)docker run -it -d --name=aceso-ae --net=host --ipc=host --gpus=all -v aceso_path:aceso_path aceso-image bash
Without specifically noted, all the following scripts are executed inside docker container.
-
Option 2: manual installation
- Python >=3.7
- CUDA 11.6
- PyTorch 1.12.0:
pip3 install torch==1.12.0+cu116 -f https://download.pytorch.org/whl/torch/
- Dependencies of Aceso runtime & Megatron-LM:
pip3 install -r requirements.txt
- Apex:
cd external/apex pip3 install -r requirements.txt pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ cd ../..
- Alpa 0.1.5 (Baseline system):
cd external/alpa pip3 install -r requirements.txt pip3 install https://github.com/alpa-projects/alpa/releases/download/v0.1.5/jaxlib-0.3.5%2Bcuda113.cudnn820-cp37-none-manylinux2010_x86_64.whl sudo apt-get update sudo apt install coinor-cbc -y cd ../..
- Parallel-ssh: (for distributed training)
sudo apt-get -y install pssh
-
Functionality check with small-scale experiments (4 hours)
Hardware requirement: 4 GPUs.
The models chosen for the small-scale experiments: GPT-3(1.3B), T5(770M), and Wide-ResNet(1B). The global batch size is set to 512 for GPT-3 and T5 models and 768 for Wide-ResNet.
In this small-scale experiment, we will run the search & train steps of Aceso and the two baselines (Alpa and Megatron-LM), at last we will compare the training throughput and search cost. We will also check the prediction accuracy of Aceso's performance model.
You can follow the instructions to perform the experiments step by step. And you can also execute only one script, which contains all the steps:
bash scripts/run_all_small.sh
(Optional) Step 1: Profile (40 minutes)
The profile step can be skipped in the artifact as we provided a pre-profiled database profiler/profiled-time-miniset/
for the small-scale experiment. But you can also profile on your own by executing:
cd profiler
bash scripts/profile_small.sh
Step 2: Search (6 minutes)
Run the search for GPT-3(1.3B), T5(770M) and Wide-ResNet(1B) models:
- GPT-3 (2 minutes):
bash scripts/aceso_gpt_search.sh small
- Wide-ResNet (1 minute):
bash scripts/aceso_resnet_search.sh small
- T5 (3 minutes):
bash scripts/aceso_t5_search.sh small
Step 3: Train (6 minutes)
We will train each model with the found configurations for 3 iterations to get the iteration time:
- GPT-3 (3 minutes):
bash scripts/aceso_gpt_execute.sh small
- Wide-ResNet (1 minute):
bash scripts/aceso_resnet_execute.sh small
- T5 (2 minutes):
bash scripts/aceso_t5_execute.sh small
Example output:
-------- gpt End-to-end throughput --------
Size Batch Size Time(s) Thpt(samples/s)
1_3B 512.0 58.98 8.68
Step 4: Compare with Alpa & Megatron-LM (3 hours)
- (Optional) Profile database for Alpa (30 minutes)
The profile step can be skipped in the artifact as we provided a pre-profiled database for Alpa (
external/alpa/prof_database.pkl
). You can also profile on your own by executing:cd external/alpa python3 gen_prof_database.py --max-comm-size-intra-node 32 --max-comm-size-inter-node 29
- Run Alpa (Search & Train): (3 hours)
- GPT-3 (1 hour):
bash scripts/alpa_gpt_search_execute.sh small
- Wide-ResNet (2 hours):
bash scripts/alpa_wresnet_search_execute.sh small
- GPT-3 (1 hour):
- Run Megatron-LM (Search & Train): (9 minutes)
- GPT-3 (4 minutes):
bash scripts/megatron_gpt_search.sh small && bash scripts/megatron_gpt_execute.sh small
- Wide-ResNet (2 minutes):
bash scripts/megatron_resnet_search.sh small && bash scripts/megatron_resnet_execute.sh small
- T5 (3 minutes):
bash scripts/megatron_t5_search.sh small && bash scripts/megatron_t5_execute.sh small
- GPT-3 (4 minutes):
- Compare training throughput:
Expected results:python3 scripts/get_e2e_performance.py small
-------- [gpt] End-to-end Throughput (Samples/s) -------- Size Megatron-LM Alpa Aceso 1_3B 6.98 7.46 8.68 -------- [t5] End-to-end Throughput (Samples/s) -------- Size Megatron-LM Alpa Aceso 770M 9.94 - 15.14 -------- [resnet] End-to-end Throughput (Samples/s) -------- Size Megatron-LM Alpa Aceso 1B 34.59 36.00 41.52
- Compare search cost:
Expected results:python3 scripts/get_search_cost.py small
-------- [gpt] Search Cost (s) -------- Size Alpa Aceso 1_3B 4092.34 99.31 -------- [t5] Search Cost (s) -------- Size Alpa Aceso 770M - 200.10 -------- [resnet] Search Cost (s) -------- Size Alpa Aceso 1B 5780.60 45.49
Step 5: Check performance model accuracy
python3 scripts/get_perf_model_acc.py small
Expected results:
-------- [gpt] Time Prediction (s) --------
Size Actual Predict
1_3B 59011.61 58760.92
-------- [gpt] Memory Prediction (MB) --------
Size Actual Predict (normal + extra)
1_3B 10692.00 11940.64 (10660.64 + 1280.00)
-------- [t5] Time Prediction (s) --------
Size Actual Predict
770M 33823.79 33659.50
-------- [t5] Memory Prediction (MB) --------
Size Actual Predict (normal + extra)
770M 10470.00 11981.22 (11319.47 + 661.75)
-------- [resnet] Time Prediction (s) --------
Size Actual Predict
1B 18662.83 17672.05
-------- [resnet] Memory Prediction (MB) --------
Size Actual Predict (normal + extra)
1B 14060.00 11866.20 (10334.83 + 1531.38)
Reproducing results with large-scale experiments
Hardware requirement: The full evaluation conducted in the paper requires 4 nodes with 8 V100(32GB) GPUs in each. But you can still check part of the results described in the paper because the search step does not require GPUs.
(Optional) Step 1: Profile (2 hours)
Hardware requirement: 32 GPUs (4 nodes * 8 GPUs/node).
The profile step can be skipped in the artifact as we provided a pre-profiled database profiler/profiled-time-eurosys/
for the large-scale experiment. But you can also profile on your own: (all the profiled results will be saved into profiler/profiled-time-eurosys-new/
by default)
- Prepare for distributed profiling:
- Modify the
profiler/profile_large_[gpt,t5,resnet].sh
file, editNUM_NODES
andNODE_RANK
for each node. - Modify the
profiler/profile_large_dist_p2p.sh
file, editMASTER_ADDR
andNODE_RANK
for each node.
- Modify the
- Parallel profiling with multiple nodes: (execute this command in each node)
cd profiler bash scripts/profile_large.sh
- Gather the profiled results into one directory
profiler/profiled-time-eurosys-new/
and share the directory among all the nodes.
Step 2: Search (45 minutes)
Hardware requirement: CPU only. The following script will run the search of all the model sizes considered in the paper:
- GPT-3 (15 minutes):
bash scripts/aceso_gpt_search.sh large
- Wide-ResNet (15 minute):
bash scripts/aceso_resnet_search.sh large
- T5 (15 minutes):
bash scripts/aceso_t5_search.sh large
All the found configs will be saved into logs/aceso/configs/[model_name]/[model_size]/top_configs/
as .json
files. We have shown two case studies in our paper, about the config of GPT-3 1.3B and Wide-ResNet 6.8B, in Sec 5.4. You can check the found configs and compare them with the ones in case studies.
Step 3: Train (2hours)
Hardware requirement: 32 GPUs (4 nodes * 8 GPUs/node).
-
Prepare for distributed training:
- Modify the training scripts: On each node, edit the file
runtime/scripts/run_[gpt,t5,resnet]_[2nodes,4nodes].sh
(if using docker, editruntime/scripts/run_[gpt,t5,resnet]_[2nodes,4nodes]_docker.sh
): modifyNODE_RANK
,MASTER_ADDR
,MASTER_PORT
with the information of your cluster. - Create parallel-ssh host file: In the
runtime
path, create a file namedpssh-2nodes.host
, which contains the two nodes you want to use in 2-node distributed training, e.g.,:
Similarly, create a10.0.0.1 10.0.0.2
pssh-4nodes.host
file. - If using docker, please make sure the container is named as
aceso-ae
in all the nodes.
- Modify the training scripts: On each node, edit the file
-
Train: (2 hours) Train all the configs found in the search step:
- If you are NOT using docker:
- GPT-3 (35 minutes):
bash scripts/aceso_gpt_execute.sh large
- Wide-ResNet (25 minutes):
bash scripts/aceso_resnet_execute.sh large
- T5 (1 hour):
bash scripts/aceso_t5_execute.sh large
- GPT-3 (35 minutes):
- If you are using docker: (execute outside of docker container)
- GPT-3 (35 minutes):
bash scripts/aceso_gpt_execute_docker.sh large
- Wide-ResNet (25 minutes):
bash scripts/aceso_resnet_execute_docker.sh large
- T5 (1 hour):
bash scripts/aceso_t5_execute_docker.sh large
- GPT-3 (35 minutes):
- If you are NOT using docker:
Step 4: Compare with Alpa & Megatron-LM (29 hours)
Hardware requirement: 32 GPUs (4 nodes * 8 GPUs/node).
- Alpa (27 hours):
- Prepare for distributed training: Launch a Ray cluster.
- Run Alpa (Search & Train): (27 hours)
- GPT-3 (9 hours):
bash scripts/alpa_gpt_search_execute.sh large
- Wide-ResNet (18 hours):
bash scripts/alpa_wresnet_search_execute.sh large
- GPT-3 (9 hours):
- Megatron-LM (2 hours):
- Prepare for distributed training: Similar to Aceso, modify
NODE_RANK
,MASTER_ADDR
,MASTER_PORT
in the filescripts/megatron_dist_scripts/run_[gpt,t5,resnet]_[2nodes,4nodes].sh
(if using docker, editscripts/megatron_dist_scripts/run_[gpt,t5,resnet]_[2nodes,4nodes]_docker.sh
instead). - Search:
- GPT-3:
bash scripts/megatron_gpt_search.sh large
- Wide-ResNet:
bash scripts/megatron_resnet_search.sh large
- T5:
bash scripts/megatron_t5_search.sh large
- GPT-3:
- Train:
- If you are NOT using docker:
- GPT-3 (30 minutes):
bash scripts/megatron_gpt_execute.sh large
- Wide-ResNet (20 minutes):
bash scripts/megatron_resnet_execute.sh large
- T5 (1 hour):
bash scripts/megatron_t5_execute.sh large
- GPT-3 (30 minutes):
- If you are using docker:
- GPT-3 (30 minutes):
bash scripts/megatron_gpt_execute_docker.sh large
- Wide-ResNet (20 minutes):
bash scripts/megatron_resnet_execute_docker.sh large
- T5 (1 hour):
bash scripts/megatron_t5_execute_docker.sh large
- GPT-3 (30 minutes):
- If you are NOT using docker:
- Prepare for distributed training: Similar to Aceso, modify
- Compare training throughput: (Figure 7)
python3 scripts/get_e2e_performance.py large bash scripts/plot_fig7.sh
- Compare search cost: (Figure 8)
python3 scripts/get_search_cost.py large bash scripts/plot_fig8.sh
Scale to 1K layers (7.5 hours)
Hardware requirement: 8 GPUs (1 node * 8 GPUs/node)
In this experiment, we will run the search and train step of a customized GPT model, scaling the number of layers from 8 to 1024.
- Aceso:(1.5 hours)
bash scripts/aceso_gpt_search.sh scale bash scripts/aceso_gpt_execute.sh scale
- Alpa: (5 hours)
ray start --head bash scripts/alpa_gpt_search_execute.sh scale ray stop
- Megatron-LM: (1 hour)
bash scripts/megatron_gpt_search.sh scale bash scripts/megatron_gpt_execute.sh scale
- Compare training throughput and search cost: (Figure 9)
python3 scripts/get_e2e_performance.py scale python3 scripts/get_search_cost.py scale bash scripts/plot_fig9.sh
Detailed usage of Aceso
This section is a reference on the detailed usage of each Aceso component: profiler, search algorithm, and runtime.
Profiling
Profile op-related information: (e.g., forward/backward execution time, input/output size, weight size, and reserved memory size)
## In the `Aceso/profiler` path
python3 op_profiler.py \
--prof-tp-size 1 \
--prof-path PATH_TO_RESULT \
--prof-cache-file PATH_TO_CACHE_FILE \
--prof-model-name gpt \
--prof-model-size all \
--prof-repeat-times 40 10 \
--prof-repeat-threshold 5000 \
--prof-warmup-times 10 \
--prof-warmup-threshold 100000
Arguments:
--prof-tp-size
: tensor-parallelism size--prof-path
: path to store the results--prof-cache-file
: path to the cache file.- If the path exists, use the cached results to speed up the profiling, otherwise cache the profiling results in the path.
--prof-model-name
: one of [gpt
,t5
,resnet
]--prof-model-size
:- for
gpt
: one of ["350M", "1_3B", "2_6B", "6_7B", "13B"] - for
t5
: one of ["770M", "3B", "6B", "11B"] - for
resnet
: one of ["500M", "2B", "4B", "6_8B", "13B"] - You can also specify
all
to profile all the model sizes in the list.
- for
--prof-repeat-times
: number of repeat times for each operator.- This argument requires two numbers in the form "num1 num2", which represents the repeat times for smaller operators and larger operators.
--prof-repeat-threshold
: the boundary execution time (us) of smaller and larger operators.--prof-warmup-times
: number of warmup times for small operators.--prof-warmup-threshold
: the boundary execution time (us) of smaller and larger operators.
Profile collective communication time:
## In the `Aceso/profiler` path
python3 comm_profiler.py \
--prof-path PATH_TO_RESULT \
--prof-cache-file PATH_TO_CACHE_FILE \
--prof-op-time-path PATH_TO_OP_PROFILING_RESULT \
--prof-tp-size 1 \
--prof-model-name gpt \
--prof-model-size all \
--prof-warmup-time 10 \
--prof-repeat-time 40
Arguments:
--prof-path
: path to store the results.--prof-cache-file
: path to the cache file.- If the path exists, use the cached results to speed up the profiling, otherwise cache the profiling results in the path.
--prof-op-time-path
: path to op-profiling results, which contains all the data sizes that need to be profiled.--prof-tp-size
: collective communication world size.--prof-model-name
and--prof-model-size
are the same as op-profiling command.--prof-warmup-time
: warmup times for each collective.--prof-repeat-time
: repeat times for each collective.
All the profiled results will be saved as .csv
files under the given --prof-path
.
Search
Search for the best configs given model information, hardware information, profiled database, and search-related hyper-parameters:
## In the `Aceso/search` path
python3 aceso_search.py \
--model-name gpt \
--model-size 1_3B \
--global-batch-size 1024 \
--micro-batch-size 1 2 4 8 \
--num-nodes 1 \
--num-gpus-per-node 4 \
--memory-limit 12000 \
--log-path PATH_TO_LOGS \
--profiled-time-path PATH_TO_PROFILING_RESULT \
--config-save-path PATH_TO_SAVE_CONFIGS \
--config-suffix UNIQUE_CONFIG_SUFFIX \
--max-num-hops 7 \
--time-budget-total 200
- Model information:
--model-name
and--model-size
: same as the arguments above for profiling.--global-batch-size
: global batch size.--micro-batch-size
: a list of considered micro batch size.
- Hardware information:
--num-nodes
: number of nodes.--num-gpus-per-node
: number of GPUs in each node.--memory-limit
: memory limit in each GPU.
- Profiled database:
--profiled-time-path
: path of the profiled database.
- Search-related hyper-parameters:
--max-num-hops
: maximum number of search hops (depth).--time-budget-total
: budget of search.
- Other arguments:
--log-path
: path of logs.--config-save-path
: path to save the found configs.--config-suffix
: a unique id for the configs.
All the found configs will be saved into --config-save-path
as .json
files. Here is one example config:
{
"model_name": "gpt",
"model_size": "1_3B",
"num_layers": 24,
"seq_length": 2048,
"max_position_embeddings": 2048,
"num_attention_heads": 32,
"hidden_size": 2048,
"global_batch_size": 32,
"micro_batch_size": 1,
"num_stages": 3,
"num_gpus": [1, 1, 2],
"checkpoint_activations": [true, true, false],
"resharding_stages": [false, false, false],
"num_ops_in_each_stage": [85, 91, 139],
"model_parallel_size_of_each_op": [
[1, 1, 1, .... 1], [1, 1, 1, .... 1], [2, 2, 2, .... 2]
],
"data_parallel_size_of_each_op": [
[1, 1, 1, .... 1], [1, 1, 1, .... 1], [1, 1, 1, .... 1]
],
"recompute_ops": [
[0, 0, 0, .... 0], [0, 0, 0, .... 0], [0, 0, 0, .... 0]
],
"algo_of_each_op": [
[0, 0, 0, .... 0], [0, 0, 0, .... 0], [0, 0, 0, .... 0]
]
}
Note that, algo_of_each_op
indicates the tensor parallelism dimension. 0
is the default partition dimension, while 1
stands for alternative one, please refer to Sec 4.2 for more details.
Training
To train the model with the found configs by Aceso, run the following command:
## In the `Aceso/runtime` path
python3 -m torch.distributed.launch $DISTRIBUTED_ARGS \
pretrain_gpt.py \
--flexpipe-config CONFIG_FILE \
--train-iters 5 \
--eval-iters 0 \
--lr-decay-iters 320000 \
--vocab-file vocabs/gpt2-vocab.json \
--merge-file vocabs/gpt2-merges.txt \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 0.00015 \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--log-interval 1 \
--DDP-impl local \
--fp16 \
--log-path LOG_PATH
Arguments:
--flexpipe-config
: a config file generated by the search process.
For the other arguments, please refer to Megatron-LM's documentation.
Troubleshooting
-
Hang at distributed training (loading fused kernel).
In Megatron-LM (and Aceso), some kernels are fused and the fused kernels need to be compiled at the first time. So it is OK if you are stuck at fused kernel loading when you run the distributed training for the first time. (The compilation takes ~ 10 minutes).
However, the kernel loading may get stuck if you directly copy the compiled fused kernels to other nodes. To solve this issue, please delete the fused kernel build folder (if it exists):
rm -rf runtime/megatron/fused_kernels/build
. Then the kernels will be recompiled automatically before the next time running. -
NCCL error when training distributedly.
When using docker, you may need to manually set the network interface for NCCL, using the environment
NCCL_SOCKET_IFNAME
. -
Check GPU topology when performance is not as expected.
Aceso always assigns adjacent GPUs to perform tensor-parallelism with the assumption that adjacent GPUs have higher bandwidth. For example, if we have 4 GPUs and want to perform 2-way tensor-parallelism and 2-way data-parallelism, Aceso will let [GPU0, GPU1] in a tensor parallelism group, and [GPU2, GPU3] in another tensor-parallelism group.
However, in some machines with complex GPU topology, the adjacent GPUs may not be connected via the highest bandwidth. You can manually change the GPU order using the environment
CUDA_VISIBLE_DEVICES
.
Reference
Please cite Aceso in your publications if it helps your research:
@inproceedings{liu2024aceso,
title={Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation},
author={Guodong Liu and Youshan Miao and Zhiqi Lin and Xiaoxiang Shi and Saeed Maleki and Fan Yang and Yungang Bao and Sa Wang},
booktitle={Proceedings of the Nineteenth European Conference on Computer Systems},
year={2024}
}
Microsoft Open Source Code of Conduct
This project has adopted the Microsoft Open Source Code of Conduct.
Resources:
- Microsoft Open Source Code of Conduct
- Microsoft Code of Conduct FAQ
- Contact opencode@microsoft.com with questions or concerns