Home

Awesome

Driple

<img src="https://raw.githubusercontent.com/gsyang33/driple/master/others/structure.jpg" alt="*Driple* structure" width="700"/>

Overview

Driple is introduced in ACM SIGMETRICS 2022. Please refer to the following papers for more details.

Driple trains a machine learning model, called Driple inspector, for predicting 12 metrics in terms of resource consumption. In particular, Driple predicts 1) burst duration, 2) idle duration, and 3) burst consumption for each 1) GPU utilization, 2) GPU memory utilization, 3) network TX throughput, and 4) network RX throughput.

Driple applies two key designs:


Driple inspector training

The implementation for training Driple inspector is in training. In training, you can find the following directories.

Driple inspector can be trained with or without transfer learning. We provide the pre-trained model we use for transfer learning. Note that we provide the pre-trained model only for GCN algorithm.

Environment setup

We implement and test the training part of Driple inspector in conda environment. The dependencies and requirements of our conda setting are given in "driple_training_requirement.txt". You can set a similar conda environment through the following command.

conda install -n <env_name> driple_training_requirement.txt

Execute training

To execute training, please follow the commands below.

python3 -m driple.train.gcn --variable --gru --epochs=100000 --patience=1000 --variable_conv_layers=Nover2 --only_graph --hidden=64 --mlp_layers=3 --data=[Dataset].pkl 
python3 -m driple.train.gcn --variable --gru --epochs=100000 --patience=1000 --variable_conv_layers=Nover2 --only_graph --hidden=64 --mlp_layers=3 --pre_trained=training/pre-train.pkl --data=[Dataset].pkl --transfer

Training dataset generation

We first provide 14 datasets used in this paper (/dataset/examples). Look at "details of the dataset below" for checking the detailed DT setting that each dataset is built.

<details><summary>Details of the dataset</summary>

NameGPUDP <br>topologyNetwork# of GPU<br>machinesNameGPUDP <br>topologyNetwork# of GPU<br>machines
V100-P1w2/ho-PCIeV100PS1/w2/homoCo-located12080Ti-P4w4/he-40G2080TiPS4/w4/hetero40 GbE2
V100-P2w2/ho-PCIeV100PS2/w2/homoCo-located1TitanRTX-P4w4/he-40GTitan<br>RTXPS4/w4/hetero40 GbE2
2080Ti-P1w2/ho-PCIe2080TiPS1/w2/homoCo-located1V100-P5w5/he-1GV100PS5/w5/hetero1 GbE5
2080Ti-P1w3/ho-PCIe2080TiPS1/w3/homoCo-located12080Ti-P5w5/he-1G2080TiPS5/w5/hetero1 GbE5
2080Ti-P2w2/he-PCIe2080TiPS2/w2/heteroCo-located1V100-P5w5/he-1GV100PS5/w10/hetero1 GbE5
TitanRTX-P2w2/he-PCIeTitan <br>RTXPS2/w2/heteroCo-located12080Ti-P5w10/he-1G2080TiPS5/w10/hetero1 GbE5
2080Ti-P2w2/he-40G2080TiPS2/w2/hetero40 GbE2
TitanRTX-P2w2/he-40GTitan<br>RTXPS2/w2/hetero40 GbE2
</details>

The dataset consists of representative image classification and natural language processing models. We use tf_cnn_benchmark and OpenNMT for running the models.

For developers who want to create their datasets, we provide an example of dataset generation below.

Input and output feature records

To be updated soon.

Training dataset generation

We convert computational graphs into adjacency and feature matrices. Also, we produce the training dataset composed of the converted matrices and output features.

python3 dataset_builder/generate_dataset.py --perf_result=[Result].csv --batch_size=32 --num_of_groups=100 --num_of_graphs=320 --save_path=[Path] --dataset_name=[Dataset].pkl

Reference