Awesome
QGTC: Accelerating Quantized GNN via GPU Tensor Core
- Cite this project and paper.
@inproceedings{QGTC,
title={QGTC: Accelerating Quantized GNN via GPU Tensor Core},
author={Yuke Wang and Boyuan Feng and Yufei Ding},
booktitle={ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. (PPoPP'22)},
year={2022}
}
Clone this project.
git clone git@github.com:YukeWang96/PPoPP22_QGTC.git
Dependencies.
- Python 3.7+.
- PyTorch 1.5.0+.
- Deep Graph Library.
Environment Setup.
[Method-1] Install via Docker (Recommended).
- (i) Pull docker image:
docker pull happy233/qgtc:updated
docker run -it --rm --gpus all -v $PWD/:/qgtc happy233/qgtc:updated /bin/bash
- (ii) Build docker from scratch:
cd Docker/
./build.sh
./launch.sh
[Method-2] Install via Conda.
- Install
conda
on system Toturial. - Create a
conda
environment:
conda create -n env_name python=3.6
- Install
Pytorch
:
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge
or using pip
[Note that make sure the pip
you use is the pip
from current conda environment. You can check this by which pip
]
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
- Install
Deep Graph Library (DGL)
.
conda install -c dglteam dgl-cuda11.0
pip install torch requests
Install QGTC. Go to QGTC_module/
, then run
TORCH_CUDA_ARCH_LIST="8.6" python setup.py clean --all install
Running Experiments
Download datasets
./download_dataset.sh
Figure 7. Speedup comparison.
- (a) Cluster GCN. You can change the
bitwidth=4
at0_7_eval_QGTC_cluster_GCN.py
to[1,2,4,8]
for evaluation. default is 2 bit.
./0_7a_eval_QGTC_cluster_GCN.py
./1_7a_eval_DGL_cluster_GCN.py
Check the results in QGTC_cluster_GCN_*bit.csv
and DGL_cluster_GCN.csv
. You will expect the result likes this.
dataset | Epoch (ms) |
---|---|
artist | 263.646 |
soc-BlogCatalog | 209.495 |
ppi | 189.016 |
ogbn-arxiv | 208.616 |
- (b) Batched GIN. You can change the
bitwidth=4
at0_7b_eval_QGTC_batched_GIN.py
to [1,2,4,8] for evaluation. default is 2 bit.
./0_7b_eval_QGTC_batched_GIN.py
./1_7b_eval_DGL_batched_GIN.py
Check the results in QGTC_batched_GIN_*bit.csv
and DGL_batched_GIN.csv
.
Figure 8: Additional studies.
- (a) Comparison with the cuBLASgemmEX (int8) on Tensor Core.
./3_8a_QGTC_GEMM_INT8.py
cd cuBLASGemmEX/
./compile.sh
./bench_cuBLAS_INT8.py
running ./2_7c_cuBLAS_INT8.py
you will get the result like this
======== 1 bit ==================
X1_height 1024, X1_width: 1024, X2_width: 16, TFLOPs: 5.847
X1_height 2048, X1_width: 2048, X2_width: 16, TFLOPs: 16.605
X1_height 4096, X1_width: 4096, X2_width: 16, TFLOPs: 40.627
X1_height 1024, X1_width: 1024, X2_width: 32, TFLOPs: 11.724
X1_height 2048, X1_width: 2048, X2_width: 32, TFLOPs: 32.666
X1_height 4096, X1_width: 4096, X2_width: 32, TFLOPs: 35.032
X1_height 1024, X1_width: 1024, X2_width: 64, TFLOPs: 23.219
X1_height 2048, X1_width: 2048, X2_width: 64, TFLOPs: 37.438
X1_height 4096, X1_width: 4096, X2_width: 64, TFLOPs: 46.768
======== 2 bit ==================
X1_height 1024, X1_width: 1024, X2_width: 16, TFLOPs: 3.934
X1_height 2048, X1_width: 2048, X2_width: 16, TFLOPs: 10.086
X1_height 4096, X1_width: 4096, X2_width: 16, TFLOPs: 20.764
X1_height 1024, X1_width: 1024, X2_width: 32, TFLOPs: 7.864
X1_height 2048, X1_width: 2048, X2_width: 32, TFLOPs: 19.762
X1_height 4096, X1_width: 4096, X2_width: 32, TFLOPs: 20.951
X1_height 1024, X1_width: 1024, X2_width: 64, TFLOPs: 15.429
X1_height 2048, X1_width: 2048, X2_width: 64, TFLOPs: 25.055
X1_height 4096, X1_width: 4096, X2_width: 64, TFLOPs: 26.818
======== 4 bit ==================
X1_height 1024, X1_width: 1024, X2_width: 16, TFLOPs: 2.488
X1_height 2048, X1_width: 2048, X2_width: 16, TFLOPs: 6.561
X1_height 4096, X1_width: 4096, X2_width: 16, TFLOPs: 12.409
X1_height 1024, X1_width: 1024, X2_width: 32, TFLOPs: 4.456
X1_height 2048, X1_width: 2048, X2_width: 32, TFLOPs: 12.807
X1_height 4096, X1_width: 4096, X2_width: 32, TFLOPs: 13.929
X1_height 1024, X1_width: 1024, X2_width: 64, TFLOPs: 10.683
X1_height 2048, X1_width: 2048, X2_width: 64, TFLOPs: 12.328
X1_height 4096, X1_width: 4096, X2_width: 64, TFLOPs: 14.196
======== 8 bit ==================
X1_height 1024, X1_width: 1024, X2_width: 16, TFLOPs: 1.541
X1_height 2048, X1_width: 2048, X2_width: 16, TFLOPs: 3.483
X1_height 4096, X1_width: 4096, X2_width: 16, TFLOPs: 6.763
X1_height 1024, X1_width: 1024, X2_width: 32, TFLOPs: 3.074
X1_height 2048, X1_width: 2048, X2_width: 32, TFLOPs: 6.816
X1_height 4096, X1_width: 4096, X2_width: 32, TFLOPs: 7.366
X1_height 1024, X1_width: 1024, X2_width: 64, TFLOPs: 5.046
X1_height 2048, X1_width: 2048, X2_width: 64, TFLOPs: 6.165
X1_height 4096, X1_width: 4096, X2_width: 64, TFLOPs: 7.324
running ./bench_cuBLAS_INT8.py
, you will get the result like this
M: 1024, K: 1024, N: 16, TFLOPS: 0.55
M: 2048, K: 2048, N: 16, TFLOPS: 2.58
M: 4096, K: 4096, N: 16, TFLOPS: 3.60
M: 1024, K: 1024, N: 32, TFLOPS: 3.89
M: 2048, K: 2048, N: 32, TFLOPS: 5.49
M: 4096, K: 4096, N: 32, TFLOPS: 6.49
M: 1024, K: 1024, N: 64, TFLOPS: 4.38
M: 2048, K: 2048, N: 64, TFLOPS: 6.30
M: 4096, K: 4096, N: 64, TFLOPS: 6.65
- (b) Zero-tile jumping efficiency.
./3_8_zero_tile_jumping.py
check the results in zerotile_jumping.csv
- (c) Adjacencymatrix size impact.
./3_9_adjmatrix_size.py
you will get the result like this
X1_height 1024, X1_width: 1024, X2_width: 16, TFLOPs: 5.831
X1_height 2048, X1_width: 2048, X2_width: 16, TFLOPs: 16.323
X1_height 4096, X1_width: 4096, X2_width: 16, TFLOPs: 34.425
X1_height 1024, X1_width: 1024, X2_width: 32, TFLOPs: 11.717
X1_height 2048, X1_width: 2048, X2_width: 32, TFLOPs: 32.027
X1_height 4096, X1_width: 4096, X2_width: 32, TFLOPs: 40.175
X1_height 1024, X1_width: 1024, X2_width: 64, TFLOPs: 23.158
X1_height 2048, X1_width: 2048, X2_width: 64, TFLOPs: 37.444
X1_height 4096, X1_width: 4096, X2_width: 64, TFLOPs: 46.759
X1_height 1024, X1_width: 1024, X2_width: 128, TFLOPs: 28.417
X1_height 2048, X1_width: 2048, X2_width: 128, TFLOPs: 40.646
X1_height 4096, X1_width: 4096, X2_width: 128, TFLOPs: 52.517
X1_height 1024, X1_width: 1024, X2_width: 256, TFLOPs: 32.089
X1_height 2048, X1_width: 2048, X2_width: 256, TFLOPs: 44.151
X1_height 4096, X1_width: 4096, X2_width: 256, TFLOPs: 59.508
X1_height 1024, X1_width: 1024, X2_width: 512, TFLOPs: 41.743
X1_height 2048, X1_width: 2048, X2_width: 512, TFLOPs: 49.687
X1_height 4096, X1_width: 4096, X2_width: 512, TFLOPs: 64.172
X1_height 1024, X1_width: 1024, X2_width: 1024, TFLOPs: 37.954
X1_height 2048, X1_width: 2048, X2_width: 1024, TFLOPs: 52.970
X1_height 4096, X1_width: 4096, X2_width: 1024, TFLOPs: 66.490