Awesome

Artifact Evaluation

Artifact Ver 0.2 - Functional Version

Dataset and AE hosts

You can download the dataset via this link: https://drive.google.com/file/d/1QztbVol4kaQjIL3lpuGvLkBRTB_6Fd1-/view?usp=sharing

You can unzip it to naspipe/[experiment name]/translation/data.

The final data direction should looks like: naspipe/[experiment name]/translation/data/wmt14_en_de_joined_dict/...

If reviewers need a bare metal host to evaluate our artifact, please leave a message in the hotcrp.

Quick Hints

If you want to exit an experiment quickly, please start another shell and use:

sudo pkill -9 python

Installation

Pull and run PyTorch official image:

docker pull pytorch/pytorch:1.9.0-cuda10.2-cudnn7-devel

cd naspipe/
nvidia-docker run -it -v $PWD:/workspace --net=host --ipc=host pytorch/pytorch:1.9.0-cuda10.2-cudnn7-devel

Inside docker:

apt-get update
apt-get install -y git
cd /
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

Fix fairseq compatibility issues:

git am < /workspace/0001-compatible.patch
cd /workspace

Enable deterministic execution:

export CUBLAS_WORKSPACE_CONFIG=:4096:8

Kick-off Functional

Please make sure the data is copied to naspipe/[experiment name]/translation/data/wmt14_en_de_joined_dict/...

cd /workspace/reproducible/translation

Run sequential execution (no parallel):

python -m launch --nnodes 1 --node_rank 0 --nproc_per_node 4 main_with_runtime_single.py --data_dir data/wmt14_en_de_joined_dict --master_addr localhost --module gpus=4 --checkpoint_dir output --distributed_backend gloo -b 3840 --lr 0.000060 --lr_policy polynomial --weight-decay 0.000000 --epochs 10 --print-freq 10 --verbose 0 --num_ranks_in_server 4 --config_path gpus=4/mp_conf.json

If you see the the following logs, the artifact runs perfectly.

Stage: [3] Epoch: [0][1/49372]	Time: 0.784 (0.784)	Id: 0	Tokens: 761	Output: 8336.44824218750000000000000000000000	
Stage: [3] Epoch: [0][2/49372]	Time: 1.943 (1.364)	Id: 1	Tokens: 3293	Output: 36547.63671875000000000000000000000000	
Stage: [3] Epoch: [0][3/49372]	Time: 1.803 (1.510)	Id: 2	Tokens: 3136	Output: 34344.91406250000000000000000000000000	
Stage: [3] Epoch: [0][4/49372]	Time: 1.467 (1.499)	Id: 3	Tokens: 1717	Output: 18729.03515625000000000000000000000000	
Stage: [3] Epoch: [0][5/49372]  Time: 2.070 (1.613) Id: 4   Tokens: 3520    Output: 38475.35156250000000000000000000000000

Functional Experiment

Please make sure the data is copied to naspipe/[experiment name]/translation/data/wmt14_en_de_joined_dict/...

Experiment 1:

Run sequential execution (no parallel, equivalant to single GPU training):

python -m launch --nnodes 1 --node_rank 0 --nproc_per_node 4 main_with_runtime_single.py --data_dir data/wmt14_en_de_joined_dict --master_addr localhost --module gpus=4 --checkpoint_dir output --distributed_backend gloo -b 3840 --lr 0.000060 --lr_policy polynomial --weight-decay 0.000000 --epochs 10 --print-freq 10 --verbose 0 --num_ranks_in_server 4 --config_path gpus=4/mp_conf.json

Outputs (forward losses) between step 96-100:

Stage: [3] Epoch: [0][96/49372] Time: 1.828 (1.748)     Id: 95  Tokens: 2976    Output: 32363.60546875000000000000000000000000
Stage: [3] Epoch: [0][97/49372] Time: 1.689 (1.747)     Id: 96  Tokens: 2880    Output: 31520.35351562500000000000000000000000
Stage: [3] Epoch: [0][98/49372] Time: 1.893 (1.749)     Id: 97  Tokens: 3552    Output: 39054.67968750000000000000000000000000
Stage: [3] Epoch: [0][99/49372] Time: 1.921 (1.751)     Id: 98  Tokens: 3456    Output: 37461.26562500000000000000000000000000
Stage: [3] Epoch: [0][100/49372]        Time: 1.926 (1.752)     Id: 99  Tokens: 3520    Output: 39656.17968750000000000000000000000000

Outputs (forward losses) between step 196-200:

Stage: [3] Epoch: [0][196/49372]        Time: 1.633 (1.752)     Id: 195 Tokens: 2208    Output: 26274.00390625000000000000000000000000
Stage: [3] Epoch: [0][197/49372]        Time: 1.901 (1.753)     Id: 196 Tokens: 3200    Output: 30433.37109375000000000000000000000000
Stage: [3] Epoch: [0][198/49372]        Time: 1.972 (1.754)     Id: 197 Tokens: 3328    Output: 40601.20703125000000000000000000000000
Stage: [3] Epoch: [0][199/49372]        Time: 1.838 (1.755)     Id: 198 Tokens: 2912    Output: 33449.57421875000000000000000000000000
Stage: [3] Epoch: [0][200/49372]        Time: 1.781 (1.755)     Id: 199 Tokens: 2912    Output: 32267.23437500000000000000000000000000

Run parallel execution (under search space c0):

python -m launch --nnodes 1 --node_rank 0 --nproc_per_node 4 main_with_runtime.py --data_dir data/wmt14_en_de_joined_dict --master_addr localhost --module gpus=4 --checkpoint_dir output --distributed_backend gloo -b 3840 --lr 0.000060 --lr_policy polynomial --weight-decay 0.000000 --epochs 10 --print-freq 10 --verbose 0 --num_ranks_in_server 4 --config_path gpus=4/mp_conf.json

Outputs (forward losses) between step 96-100:

(Note that: the parallel execution might be re-ordered, please match Output with Id.)

Stage: [3] Epoch: [0][96/49372] Time: 1.179 (0.924)     Id: 96  Tokens: 2880    Output: 31520.35351562500000000000000000000000
Stage: [3] Epoch: [0][97/49372] Time: 0.842 (0.924)     Id: 95  Tokens: 2976    Output: 32363.60546875000000000000000000000000
Stage: [3] Epoch: [0][98/49372] Time: 1.023 (0.925)     Id: 97  Tokens: 3552    Output: 39054.67968750000000000000000000000000
Stage: [3] Epoch: [0][99/49372] Time: 0.699 (0.922)     Id: 101 Tokens: 2400    Output: 29934.16992187500000000000000000000000
Stage: [3] Epoch: [0][100/49372]        Time: 1.329 (0.926)     Id: 98  Tokens: 3456    Output: 37461.26562500000000000000000000000000

Outputs (forward losses) between step 196-200:

(Note that: the parallel execution might be re-ordered, please match Output with Id.)

Stage: [3] Epoch: [0][196/49372]        Time: 0.790 (0.966)     Id: 200 Tokens: 2688    Output: 30363.91601562500000000000000000000000
Stage: [3] Epoch: [0][197/49372]        Time: 1.119 (0.967)     Id: 193 Tokens: 3840    Output: 42197.10546875000000000000000000000000
Stage: [3] Epoch: [0][198/49372]        Time: 1.385 (0.969)     Id: 197 Tokens: 3328    Output: 40601.20703125000000000000000000000000
Stage: [3] Epoch: [0][199/49372]        Time: 0.941 (0.969)     Id: 196 Tokens: 3200    Output: 30433.37109375000000000000000000000000
Stage: [3] Epoch: [0][200/49372]        Time: 1.018 (0.969)     Id: 201 Tokens: 3456    Output: 43732.59765625000000000000000000000000

Experiment 2:

In Experiment 2, we provide the original environment where we conducted our performance evaluation. The throughput trend matches our results in Figure 5 (although Figure 5 was conducted on 8 GPUs).

Run 200 steps is enought to get a stable throughput.

Pull the transformer image with apex installed:

docker pull zsxhku/transformer:apex

Start docker under the AE directory:

cd naspipe/
nvidia-docker run -it -v $PWD:/workspace --net=host --ipc=host zsxhku/transformer:apex
cd /workspace/throughput/translation

Run throughput with different search configurations by modifying the last input argument (e.g., modifiy --input_path [space_config] to --input_path config_4_4.json):

python -m launch --nnodes 1 --node_rank 0 --nproc_per_node 4 main_with_runtime.py --data_dir data/wmt14_en_de_joined_dict --master_addr localhost --module gpus=4 --checkpoint_dir output --distributed_backend gloo -b 3840 --lr 0.000060 --lr_policy polynomial --weight-decay 0.000000 --epochs 10 --print-freq 10 --verbose 0 --num_ranks_in_server 4 --config_path gpus=4/mp_conf.json --input_path [space_config]

space_config c0 file: config_4_4.json

Expected output at the 200 step, estimated time to execute 1K subnets is (0.129):

Stage: [3] Epoch: [0][200/1000] Time(1639706538.722864): 0.329 (0.404)  Epoch time [hr]: 0.026 (0.129)

space_config c1 file: config_4_3.json

Expected output at the 200 step, estimated time to execute 1K subnets is (0.144):

Stage: [3] Epoch: [0][200/1000] Time(1639706693.9953368): 0.349 (0.423) Epoch time [hr]: 0.029 (0.144)

space_config c2 file: config_4_2.json

Expected output at the 200 step, estimated time to execute 1K subnets is (0.153):

Stage: [3] Epoch: [0][200/1000] Time(1639706892.0565453): 0.323 (0.442) Epoch time [hr]: 0.031 (0.153)

space_config 4 file: config_4.json

Expected output at the 200 step, estimated time to execute 1K subnets is (0.214):

Stage: [3] Epoch: [0][200/1000] Time(1639707112.2461872): 1.513 (0.672) Epoch time [hr]: 0.043 (0.214)

Experiment 2 Auto-generate Figure 5

We have provided the sample raw output (sample) and sample figure (sample_figure.pdf).

You can directly generate the figure based on our sample raw output; or your can regenerate the raw output yourself.

The experiments can be run on 4GPUs, and you can observe a same trend with the Figure 5 of our paper.

Start Docker

[NOTE!] Please make sure the data is copied to naspipe/baselines/translation/data/wmt14_en_de_joined_dict/...

cd naspipe/
nvidia-docker run -it -v $PWD:/workspace --net=host --ipc=host zsxhku/transformer:apex
cd /workspace/

Run experiments

Run Figure 5 script and save the log to a file:

./figure5.sh | tee output

If you encounter any errors, clean the processes by running the naspipe/clean.sh on host (not inside the docker); then re-run the script. You can manually select experiments (by deleting unwanted experiments from the script).

Generate Figures

Make sure the matplotlib is installed:

python -m pip install matplotlib

Generate Figure5:

python gen_figure.py output figure5

Then you will get figure5.pdf.

Get Figure PDF from the server via Email

sudo apt-get install mailutils

For the options, choose Internet Site.

Use scp to copy the generated figures;

Or you can use the below command to send the generated PDFs to your email (remember to set your email address).

echo "Figure 5" | mail -s "Experiment Figure 5" your_email -A figure5.pdf