Home

Awesome

Artifact for Lucid

This repository contains the artifact for our ASPLOS '23 paper "Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs". It includes following parts:

Getting Started

Results Reproduction (for ASPLOS '23 Artifact Evaluation)

simulation (adopted from Helios) contains instructions for reproducing the Venus cluster experiments shown in Section 4. These scripts have been tested on Ubuntu 20.04 with Python 3.9.

0. Structure

The contents inside simulation folder are summarized as follows:

1. Environment Preparation

We suggest using a conda environment to install the dependencies:

conda create -n lucid python=3.9
conda activate lucid
cd simulation
pip install -r requirements.txt

Besides, we recommend execute Jupyter notebook (.ipynb) files with VSCode or JupyterLab (conda install jupyterlab).

2. Lucid Model Training and Interpretation

We train Throughput Predict Model as a reproduction example. Please follow below steps:

More model training codes are also provided (estimator/estimator_lucid.ipynb and analyzer/analyzer.py).

3. Reproduce Baseline Results

Use the following command to run all baselines simultaneously

cd simulation
python simulator.py --sweep 

The output of this script looks like this:

2022 Oct 08 14:32:57 | MainProcess | Total Job Number in Cluster Training: 23859
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13220000 | Total Job: 7603 | End job: 13 | Running job: 2 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13220000 | Total Job: 2826 | End job: 0 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13230000 | Total Job: 7603 | End job: 120 | Running job: 4 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13230000 | Total Job: 2826 | End job: 0 | Running job: 1 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13240000 | Total Job: 7603 | End job: 120 | Running job: 4 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-3 | vcHvQ | Time: 13220000 | Total Job: 2654 | End job: 1 | Running job: 1 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13240000 | Total Job: 2826 | End job: 0 | Running job: 1 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13250000 | Total Job: 7603 | End job: 121 | Running job: 4 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-4 | vcvGl | Time: 13220000 | Total Job: 1452 | End job: 0 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13250000 | Total Job: 2826 | End job: 0 | Running job: 2 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-3 | vcHvQ | Time: 13230000 | Total Job: 2654 | End job: 2 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13260000 | Total Job: 7603 | End job: 162 | Running job: 9 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-5 | vc8Gr | Time: 13220000 | Total Job: 710 | End job: 0 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-4 | vcvGl | Time: 13230000 | Total Job: 1452 | End job: 1 | Running job: 2 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-5 | vc8Gr | Time: 13230000 | Total Job: 710 | End job: 0 | Running job: 1 | Pending job: 0

4. Reproduce Lucid Results

Similarly, use the following command to run all baselines simultaneously

python simulator.py -s lucid

The output of this script looks like this:

2022 Oct 08 14:45:07 | MainProcess | Total Job Number in Cluster Training: 23859
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13220000 | Total Job: 23859 | End job: 17 | Running job: 1 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13230000 | Total Job: 23859 | End job: 134 | Running job: 0 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13240000 | Total Job: 23859 | End job: 134 | Running job: 0 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13250000 | Total Job: 23859 | End job: 136 | Running job: 0 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13260000 | Total Job: 23859 | End job: 249 | Running job: 3 | Pending job: 4 | Avail Nodes: 1
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13270000 | Total Job: 23859 | End job: 385 | Running job: 3 | Pending job: 2 | Avail Nodes: 1
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13280000 | Total Job: 23859 | End job: 589 | Running job: 2 | Pending job: 0 | Avail Nodes: 1
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13290000 | Total Job: 23859 | End job: 780 | Running job: 2 | Pending job: 0 | Avail Nodes: 2

After the program is executed, you can check the result in the log folder. The job log and time sequence of each VC are provided separately.

5. Visualize the Key Results

We provide simulation analysis and plot scripts to generate the figures shown in our paper. Please follow below steps:

Workloads Profiling

This part profile contains code for profiling metrics of multiple workloads.

Directory

Note that ./result/ will be created when main_co.py or main_single.py is launched.

Basic Usage

Run main_co.py will generate the colocated jobs' metrics under ./result/colocate. Run main_single.py will generate single jobs' metrics under ./result/. Some specific settings can be set in each workload's profiling file, e.g.profile_cifar.py. The output will be like this:

imagenet + imagenet
co-locate:
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
co-locate:
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 1 mp..
co-locate:
==> Training mobilenet_v3_small model with 32 batchsize, 1 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 1 mp..
imagenet + cifar10
co-locate:
Files already downloaded and verified
==> Training ResNet18 model with 32 batchsize, 0 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
...

Datasets

The data path storing all datasets is specified in ./workloads/settings.py as data_dir. You can also specify the total runtime of some workloads by changing total_runtime.