Home

Awesome

Artifact for SC '21

DOI

This repository contains the artifact for the SC '21 paper "Characterization and Prediction of Deep LearningWorkloads in Large-Scale GPU Datacenters". It includes following four parts:

We have update notebooks and scripts for full four traces!

Detailed Introduction

enviornment

Provide details on the experimental environment as shown in Appendix: Artifact Description/Artifact Evaluation.

data

Initially, this folder is NOT exist. You need to download and unzip the dataset from HeliosData. After that, this folder structure should be:

📦data
 ┣ 📂Earth
 ┃ ┣ 📜cluster_gpu_number.csv
 ┃ ┗ 📜cluster_log.csv
 ┣ 📂Saturn
 ┃ ┣ 📜cluster_gpu_number.csv
 ┃ ┗ 📜cluster_log.csv
 ┣ 📂Uranus
 ┃ ┣ 📜cluster_gpu_number.csv
 ┃ ┗ 📜cluster_log.csv
 ┗ 📂Venus
 ┃ ┣ 📜cluster_gpu_number.csv
 ┃ ┗ 📜cluster_log.csv

analysis

Contains parsing and plotting code to analyze traces.

framework

An prediction-based GPU resource management framework.

This folder contains QSSF Service and CES Service scripts and related data.

Quick Start

These scripts have been tested on Ubuntu 20.04 with Python 3.8 (on the analysis platform).

Here are the step-by-step instructions for artifact.

Preparing

  1. Download Helios artifact and data repository.

    git clone git@github.com:S-Lab-System-Group/HeliosArtifact.git
    cd HeliosArtifact
    
    git clone git@github.com:S-Lab-System-Group/HeliosData.git
    unzip ./HeliosData/data.zip -d ./
    
  2. Check software dependencies:

    For the analysis part, JupyterLab / JupyterNotebook is needed.

    For the other python libraries used in this project, you can check requirements.txt.

Reproducing analysis

  1. Prepare and parse the trace files for analyzing.

    cd analysis
    python ./trace_parser.py
    
  2. After generating all required data, you can analyze traces through .ipynb files within 4 sub-folders of analysis:1_compare with Philly trace, 2_cluster characterization, 3_job characterization, 4_user characterization.

    These Jupyter Notebook scripts are used for generating figures of the trace analysis part of the paper.

Reproducing framework

QSSF Service

  1. Before executing the simulation of QSSF service, data preparation is needed.

    It generates VC configuration and job trace for each cluster.

    cd framework/QSSF\ Service/data
    bash prepare_data.sh 
    
  2. Then, you can run all scheduling policies on Philly trace using sweep mode, as below:

    cd ..
    python simulator.py -e='Philly' -t='./data/Philly' --sweep 
    

    See run.sh for more usage examples on Helios. Note that since we do not release job name information, the estimator and qssf policy are not available for Helios.

  3. After the program is executed, you can check the result in the log folder. The job log and time sequence of each VC are provided separately.

  4. Besides, we provide simulation analysis and plot script in plot.

    You can generate Figure 13 in the paper through this script.

CES Service

  1. Run CES simulation on Helios:

    cd framework/CES\ Service
    python CES_Helios.py
    

    You can specify different cluster in the script and adjust the different configurations of CES service by the hyperparameter function.

  2. Similarly, run CES simulation on Philly:

    python CES_Philly.py
    
  3. From the code output and generated figures helios_ces (Figure 14) & philly_ces (Figure 15), we can analyze the CES service performance in detail.