Home

Awesome

Overview

DOI

This repository contains public releases of SenseTime Helios traces for the benefit of the deep learning system research community.

<!-- Note that [Git LFS](https://git-lfs.github.com/) is required for downloading Helios traces. -->

If you do use the Helios traces in your research, please make sure to cite our SC '21 paper "Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters", which includes a comprehensive analysis of the deep learning workloads in Helios from April 2020 to September 2020.

We encourage anyone to use the traces for academic purposes, and if you had any questions, feel free to send an email to us, or file an issue on Github.

<!-- > **Note that only the `Venus` trace is public available now. Other traces are being censored. We will release them as soon as possible.** -->

Helios Description

Helios is a private datacenter dedicated to developing DL models for research and production in SenseTime. It contains 8 independent GPU clusters and over 12,000 GPUs in total.

In this repository, we publicly release the workload trace in 4 representative GPU clusters: Earth, Saturn, Uranus, and Venus. You can find a detailed description of the Helios datacenter in the SC '21 paper mentioned above.

Besides, we also release the analysis scripts for Helios traces in HeliosArtifact.

Helios Dataset

The main trace characteristics, dataset structure and schema are:

Main Characteristics:

Dataset Structure

Each cluster provides a job trace file (cluster_log.csv) and a VC configuration file (cluster_gpu_number.csv).

📦data
 ┣ 📂Earth
 ┃ ┣ 📜cluster_gpu_number.csv
 ┃ ┗ 📜cluster_log.csv
 ┣ 📂Saturn
 ┃ ┣ 📜cluster_gpu_number.csv
 ┃ ┗ 📜cluster_log.csv
 ┣ 📂Uranus
 ┃ ┣ 📜cluster_gpu_number.csv
 ┃ ┗ 📜cluster_log.csv
 ┗ 📂Venus
 ┃ ┣ 📜cluster_gpu_number.csv
 ┃ ┗ 📜cluster_log.csv

Schema and Description

cluster_log.csv

Description

Provides rich information on all jobs submitted to Slurm in each cluster.

Example

job_iduservcgpu_numcpu_numnode_numstatesubmit_timestart_timeend_timedurationqueue
1425511uXBbcvcJkd111COMPLETED2020-06-09 18:41:012020-06-09 18:41:012020-06-10 04:55:09368480
1425512uVMrFvchbv4161FAILED2020-06-09 18:41:272020-06-09 18:41:272020-06-09 18:45:362490
1425513uzqlsvcpDC111CANCELLED2020-06-09 18:41:282020-06-09 18:41:282020-06-17 14:15:216752330

Schema

FieldDescription
job_idunique id of the job <sup>1</sup>
userhashed id for the user, prefix is 'u'
vchashed id for the virtual cluster, prefix is 'vc'
gpu_numnumber of GPUs required for the job
cpu_numnumber of CPUs required for the job
node_numnumber of nodes in the job
statethe job's status upon termination <sup>2</sup>
submit_timethe job's submission time
start_timethe job's start execution time
end_timethe job's termination time
durationtotal job execution time of the job <sup>3</sup>
queuetotal job queue time of the job <sup>4</sup>

Notes

  1. job_id is generated by Slurm and reflects the job submission order in each cluster.
  2. A job can end up with one of five statuses: (1) COMPLETED: it is finished successfully; (2) CANCELLED: it is terminated by the user; (3) FAILED: it is terminated due to internal or external errors; (4) TIMEOUT: the execution time is out of limit; (5) NODE_FAIL: it is terminated due to the node crash. TIMEOUT and NODE_FAIL are very rare in our traces, and are regarded as failed in our analysis. (Another status SUSPENDED happens only once in cluster Uranus, so we ignore it.)
  3. Calculated from the difference between end_time and start_time. (Unit: seconds)
  4. Calculated from the difference between start_time and submit_time. (Unit: seconds)

cluster_gpu_number.csv

Description

Lists the number of GPUs per day in each VC.

Example

datevchbvvc4omvcVP5vc6YEvchA3vccaAvcTJsvcvlYvcSoLvcModvcpDCvc3slvc8SjvcJLVvcLJZvcIyavcJkdvcdI0vciravcgkzvcxS0vc7hDvcXrBvcvcMvcp4Ototal
2020-09-01649617621640404896326456640326401616168000001144

Schema

FieldDescription
daterecord granularity is daily
vc***the number of GPUs of the VC
totalthe total number of GPUs of the cluster