Home

Awesome

Overview

This repository contains a representative subset of the first-party DNN training workloads on Microsoft's internal Philly clusters. The trace is a sanitized subset of the workload described in "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads" in ATC’19. This work was done as part of Microsoft Research's Project Fiddle.

We include in this repository a jupyter notebook that highlights the main characteristics of the traces and shows how to parse them (a huge thank you to Keshav Santhanam for putting this together).

We provide the trace as is. If you do use this trace in your research, please make sure to cite our ATC’19 paper (mentioned above).

Trace Details

Main characteristics:

Schema:

cluster_job_log

Description: Contains information about each job, including each individual successful scheduling attempt.

Format: JSON

Example entry:

{
    "status": "Pass",
    "vc": "ee9e8c",
    "jobid": "application_1506638472019_14199",
    "attempts": [
        {
            "start_time": "2017-10-07 01:12:09",
            "end_time": "2017-10-07 01:13:23",
            "detail": [
                {
                    "ip": "m47",
                    "gpus": [
                        "gpu0",
                        "gpu1",
                        "gpu2",
                        "gpu3",
                        "gpu4",
                        "gpu5",
                        "gpu6",
                        "gpu7"
                    ]
                }
            ]
        },
        {
            "start_time": "2017-10-07 01:13:30",
            "end_time": "2017-10-09 06:53:12",
            "detail": [
                {
                    "ip": "m412",
                    "gpus": [
                        "gpu0",
                        "gpu1",
                        "gpu2",
                        "gpu3",
                        "gpu4",
                        "gpu5",
                        "gpu6",
                        "gpu7"
                    ]
                }
            ]
        }
    ],
    "submitted_time": "2017-10-07 01:11:39",
    "user": "ce2f4c"
}

List of keys:

Notes:


cluster_gpu_util

Description: Provides a per-minute record of each GPU's utilization as reported by nvidia-smi.

Format: CSV

Columns:

timemachineIdgpu0_utilgpu1_utilgpu2_utilgpu3_utilgpu4_utilgpu5_utilgpu6_utilgpu7_util

Example entry:

2017-10-03 00:08:00 PDT,m29,60.8,99.366666667,100.0,63.333333333,100.0,100.0,100.0,100.0,

Notes:


cluster_cpu_util

Description: Provides a per-minute record of each server's CPU utilization.

Format: CSV

Columns:

timemachine_idcpu_util

Example entry:

2017-11-27 00:04:00 PST,m29,31.845

Notes:


cluster_mem_util

Description: Provides a per-minute record of each server's memory utilization.

Format: CSV

Columns:

timemachine_idmem_totalmem_free

Example entry:

2017-10-03 00:06:00 PDT,m29,528272672.0,2030730.6667

Notes:


cluster_machine_list

Description: Lists the number of GPUs and per-GPU memory available on each server in the cluster.

Format: CSV

Columns:

machineIdnumber of GPUssingle GPU mem

Example entry:

m31,8, 24GB