Home

Awesome

GNNAdvisor: An Efficient Runtime System for GNN Acceleration on GPUs

@inproceedings{GNNAdvisor,
  title={GNNAdvisor: An Efficient Runtime System for GNN Acceleration on GPUs},
  author={Yuke Wang and Boyuan Feng and Gushu Li and Shuangchen Li and Lei Deng and Yuan Xie and Yufei Ding},
  booktitle={USENIX Symposium on Operating Systems Design and Implementation (OSDI'21)},
  year={2021}
}

1. Getting Started Instructions.

git clone --recursive git@github.com:YukeWang96/OSDI21_AE.git
  • cu102/: dockerfile for sm < 80, such as Quadro P6000 and Tesla V100.
  • cu110/: dockerfile for sm >= 80, such as RTX 3090.
  • GNNConv/: the C++/CUDA source code (GNNAdvisor_kernel.cu) for GNN sparse computation kernel, python binding of kernels (GNNAdvisor.cpp) and python setup.py installation script.
  • gnn_conv.py: the Python script for defining the GNN convolution at high-level.
  • param.py: the Python script for defining the input-level properties and different rules for handling this properties to generate performance-related configuration, such as warpPerBlock.
  • dataset.py: the Python loader for datasets from either plain .txt edgeList files or binary .npy file.
  • ./s7-4_1_neighbor_partitioning.py, ./s7-4_2_dimension_partitiong.py, ./s7-4_3_node_renumbering.py and ./s7-5_1_hidden_dimension.py are for running additional studies in our paper.

Step-1: Environment Setup

There are two ways to setup the environment of GNNAdvisor and baselines.

+ Method 1: Setup the environment via Docker (Recommended).

curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

then you need to

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
docker run -it --rm --gpus device=1 -v $PWD/../../:/GNNA osdi-ae:latest /bin/bash

+ Method 2: Setup via conda and pip

1) Install system packages for compiling rabbit reordering (root user required).

2) Install Pytorch environment.

conda create -n env_name python=3.6
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge

or using pip [Note that make sure the pip you use is the pip from current conda environment. You can check this by which pip]

pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install tqdm
pip install scipy
conda install -c dglteam dgl-cuda11.0
pip install torch requests
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.0+cu111.html
pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.0+cu111.html
pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.8.0+cu111.html
pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.8.0+cu111.html
pip install torch-geometric

Step-2: Download the graph datasets.

wget https://storage.googleapis.com/graph_dataset/osdi-ae-graphs.tar.gz

3. Detailed Instructions.

  • ./0_bench_dgl_gcn.py| tee run_dgl_gcn.log to run the script and the report 200 epoch runtime for all evaluated datasets.
  • ./1_log2csv.py to convert the run_dgl_gcn.log to run_dgl_gcn.csv for ease of visualization.
  • ./0_bench_pyg_gcn.py| tee run_pyg_gcn.log to run the script and the report 200 epoch runtime for all evaluated datasets.
  • ./1_log2csv.py run_pyg_gcn.log to convert log result to run_pyg_gcn.csv for ease of analysis.
  • ./0_bench_GNNA_GCN.py| tee run_GNNA_GCN.log to run the script and the report 200 epoch runtime for all evaluated datasets. Note that there are also several options (such as enable_rabbit) for configuring a profiling.
  • ./1_log2csv.py to convert the run_GNNA_GCN.log to run_GNNA_GCN.csv for ease of result analysis.
  • --dataset: the name of the dataset.
  • --dim: the size of input embedding dimension, default: 96.
  • --hidden: the size of hidden dimension, default: 16.
  • --classes: the number of output classes, default: 22.
  • --partSize: the size of neighbor-group, default: 32.
  • --dimWorker: the number of worker threads (<=32), default: 32.
  • --warpPerBlock: the number of warp per block, default: 8, recommended: GCN: (8), GIN: (2 for citeseer, 8 for remaining datasets).
  • --sharedMem: the shared memory size for each Stream-Multiprocessor on NVIDIA GPUs. A reference for different GPU architecture and its shared memory size can be found at here, default 96KB for RTX3090.
  • --model: gcn or gin. The evaluated example GCN model has 2 layers with 16 hidden dimensions, while the example GIN model has 5 layers with 64 hidden dimensions.
  • --num_epoches: the number of epoches for training, default: 200.
  • --loadFromTxt: If this flag is True, it will load the graph TXT edge list, where each line is an s1 d1. default: False (load from .npz which is fast).
  • --enable_rabbit: If this flag is True, it will be possible to use the rabbit-reordering routine. Otherwise, it will skip rabbit reordering for both auto and manual mode.
  • --manual_mode: If this flag is True, it will use the value from the parameter partSize, dimWorker and dimWorker. Otherwise, it will determine these three performance-related parameters automatically by Decider. Note that Decider will generate two different sets of parameters for input and hidden layers based on a GNN model and the dataset input characters. In manual mode the value of partSize, dimWorker and dimWorker will be applied to both input and hidden layer.
  • --verbose_mode: If this flag is True, it will print out all the details of configuration for running the experiments.
  • --single_spmm: If this flag is True, it will only profile a single spmm for 200 rounds. with the provided --dim as the D in NxNxD, where N is the number of nodes in a graph. Run ./3_single_spmm_bench.py for profiling single neighbor aggregation (SpMM) kernel in comparison with Gunrock SpMM.
  • --verify_spmm: If this flag is True, it will check the correctness of our SpMM kernel against the CPU reference result. Run ./4_verifying.py for verifying our major kernel (neighbor aggregation) correctness against CPU reference result from torch_sparse.spmm.

Note

  • Based on our profiling on RTX3090 and Quadro P6000, our design would show minor speedup on the simple GCN model (2-layer and 16 hidden dimension), but show more evident speedup on more complicated GIN model (5-layer and 64 hidden dimension), which can still demonstrate the effectiveness of our optimizations.
  • Our observation is that on small Type I graphs, our frameworks achieve significant speedup for both GCN and GIN model on RTX3090 and Quadro P6000. On larger Type II and Type III datasets, our GIN model implementation would show more evident speedups.
  • For neighbor_partitioning. Neighbor Partitioning
  • For dimension_partitiong. Dimension Partitioning
  • For hidden_dimension. Dimension Partitioning
  • For node_renumbering.<br/> Dimension Partitioning

Reference