Awesome
Baechi: Fast Device Placement on Machine Learning Graphs (SoCC 2020)
Install dependencies
- Install dependencies with Anaconda
$ conda install -y python=3.6 numpy=1.16 tensorflow-gpu=1.12 bazel=0.20.0 \
networkx future matplotlib cvxopt scikit-learn
- Mosek
$ pip install -f https://download.mosek.com/stable/wheel/index.html Mosek==8.1.82
Our code runs MOSEK as an LP solver for SCT.
MOSEK provides a free personal academic license.
You can request a license at https://www.mosek.com/products/academic-licenses.
The license file (mosek.lic
) should be placed at $HOME/mosek
.
Example usage
This example generates the placement of 4-layer GNMT v2 with a batch size of 128, a maximum sequence length of 40, and a vocabulary size of 30000.
- Build a Python program to place operators of an ML model.
$ bazel build :train
- Generate profiles.
$ ./bazel-bin/train \
--costgen \
--cost_path=/tmp/cost.pkl \
--optimizer=adam \
--batch_size=128 \
--model_name=gnmt_v2 \
--vocab_size=30000 \
--max_seq_length=40 \
--rnn_unit_type=lstm \
--rnn_units=512 \
--num_layers=4 \
--encoder_type=gnmt \
--num_gpus=4 \
--residual \
--colocate_grads_with_ops \
--only_forward
This generates profiles of the forward pass and stores them at /tmp/cost.pkl
.
- Generate a communication cost function between GPUs through the linear regression.
$ bazel build //utils:communication_benchmark
$ ./bazel-bin/utils/communication_benchmark
This runs a benchmark that transfers tensors between different GPUs for various tensor sizes.
By default, the benchmark transfers tensors from GPU:0
to GPU:1
with tensor sizes in the range [2<sup>0</sup>, 2<sup>29</sup>].
After the benchmark finishes, it prints out a generated communication cost function that
should be given as the --comm_cost_coeffs
argument value for the placement.
An example output would be the following.
...
Communication cost function: 0.0001754 x + 134
- Place operators of GNMT v2 and measure average step times.
$ ./bazel-bin/train \
--cost_path=/tmp/cost.pkl \
--optimizer=adam \
--batch_size=128 \
--model_name=gnmt_v2 \
--vocab_size=30000 \
--max_seq_length=40 \
--rnn_unit_type=lstm \
--rnn_units=512 \
--num_layers=4 \
--encoder_type=gnmt \
--num_gpus=4 \
--residual \
--colocate_grads_with_ops \
--only_forward \
--placement_method=m_etf \
--placer_type=fusion \
--grouper=coplace \
--comm_cost_coeffs=0.0001754,134 \
--memory_fraction=1.0
This runs the placement of GNMT v2 operators using m-ETF based on the forward operators. When the placement is done, this measures the average step time of the placement results and prints it out.
Docker image
A Docker image with all dependencies installed is available.
$ docker pull beomyeol/baechi
$ docker run -it --rm --gpus all beomyeol/baechi /bin/bash
This gives you direct access to the container with all GPUs enabled. You can follow the example usage within the container.
License
University of Illinois/NCSA Open Source License