Home

Awesome

APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores

[Paper on arXiv]

@inproceedings{APNN-TC,
  title={APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores},
  author={Boyuan Feng, Yuke Wang, Tong Geng, Ang Li, Yufei Ding.},
  booktitle={The International Conference for High Performance Computing, Networking, Storage, and Analysis. (SC'21)},
  year={2021}
}

Clone this project.

git clone --recursive git@github.com:BoyuanFeng/APNN-TC.git
cd APNN-TC-kernel && git checkout main

in case of missing --recursive during the clone

git submodule init
git submodule update

OS & Compiler:

Files & Directory

Setup Environment.

curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
cd Docker/
build.sh
launch.sh

or pull docker image from docker hub and launch.

docker pull happy233/apnn-tc:main
docker run -it --rm --gpus all -v $(PWD):/apnn-tc happy233/apnn-tc:main /bin/bash

Experiments

APNN-TC -- GEMM and CONV kernel

Note that

CUTLASS -- GEMM kernel

// #define BIT_WIDTH 1
#define BIT_WIDTH 4

CUTLASS -- CONV kernel

// #define BIT_WIDTH 1
#define BIT_WIDTH 4

APNN-TC -- NN model

CUTLASS -- NN model

#define BIT_WIDTH 32
// #define BIT_WIDTH 16
// #define BIT_WIDTH 8

Expected Result.

APNN-TC vs CUTLASS on GEMM kernel.

CUTLASS-GEMM (4-bit). M:     64, N:    128, K:    128,   Time (ms): 0.01, TOPS: 0.35
CUTLASS-GEMM (4-bit). M:     64, N:    256, K:    256,   Time (ms): 0.01, TOPS: 1.14
CUTLASS-GEMM (4-bit). M:     64, N:    384, K:    384,   Time (ms): 0.01, TOPS: 2.21
CUTLASS-GEMM (4-bit). M:     64, N:    512, K:    512,   Time (ms): 0.01, TOPS: 3.43
CUTLASS-GEMM (4-bit). M:     64, N:    640, K:    640,   Time (ms): 0.01, TOPS: 4.77
CUTLASS-GEMM (4-bit). M:     64, N:    768, K:    768,   Time (ms): 0.01, TOPS: 6.20
CUTLASS-GEMM (4-bit). M:     64, N:    896, K:    896,   Time (ms): 0.01, TOPS: 7.67
CUTLASS-GEMM (4-bit). M:     64, N:   1024, K:   1024,   Time (ms): 0.01, TOPS: 9.18
V30, 64x64. M_GLOBAL: 64, N_GLOBAL: 128, K_GLOBAL: 128, X_BIT: 2, W_BIT: 1, Time: 0.004708 ms, TOPS: 0.45
V30, 64x64. M_GLOBAL: 64, N_GLOBAL: 256, K_GLOBAL: 256, X_BIT: 2, W_BIT: 1, Time: 0.004964 ms, TOPS: 1.69
V30, 64x64. M_GLOBAL: 64, N_GLOBAL: 384, K_GLOBAL: 384, X_BIT: 2, W_BIT: 1, Time: 0.005370 ms, TOPS: 3.52
V30, 64x64. M_GLOBAL: 64, N_GLOBAL: 512, K_GLOBAL: 512, X_BIT: 2, W_BIT: 1, Time: 0.005512 ms, TOPS: 6.09
V30, 64x64. M_GLOBAL: 64, N_GLOBAL: 640, K_GLOBAL: 640, X_BIT: 2, W_BIT: 1, Time: 0.006140 ms, TOPS: 8.54
V30, 64x64. M_GLOBAL: 64, N_GLOBAL: 768, K_GLOBAL: 768, X_BIT: 2, W_BIT: 1, Time: 0.006171 ms, TOPS: 12.23
V30, 64x64. M_GLOBAL: 64, N_GLOBAL: 896, K_GLOBAL: 896, X_BIT: 2, W_BIT: 1, Time: 0.006805 ms, TOPS: 15.10
V30, 64x64. M_GLOBAL: 64, N_GLOBAL: 1024, K_GLOBAL: 1024, X_BIT: 2, W_BIT: 1, Time: 0.007194 ms, TOPS: 18.66

APNN-TC vs CUTLASS on CONV kernel.

Precision,      Layer,  N,      H,      W,      C,      K,      R,      S,      Runtime,        TFLOPs
BIT_WIDTH-4,    conv_1, 1,      16,     16,     128,    128,    3,      3,      0.0144896,      5.21046
BIT_WIDTH-4,    conv_2, 1,      16,     16,     256,    256,    3,      3,      0.02304,        13.1072
BIT_WIDTH-4,    conv_3, 1,      16,     16,     384,    384,    3,      3,      0.031592,       21.5079
BIT_WIDTH-4,    conv_4, 1,      16,     16,     512,    512,    3,      3,      0.0401408,      30.0931
BIT_WIDTH-4,    conv_5, 1,      16,     16,     640,    640,    3,      3,      0.04864,        38.8042
BIT_WIDTH-4,    conv_6, 1,      16,     16,     768,    768,    3,      3,      0.0572416,      47.4814
BIT_WIDTH-4,    conv_7, 1,      16,     16,     896,    896,    3,      3,      0.065792,       56.2284
BIT_WIDTH-4,    conv_8, 1,      16,     16,     1024,   1024,   3,      3,      0.0743424,      64.9944
H: 16, W: 16, CIN: 128, COUT: 128, W_BIT: 1, X_BIT: 2, Time: 0.006213 ms, TOPS: 12.15
H: 16, W: 16, CIN: 256, COUT: 256, W_BIT: 1, X_BIT: 2, Time: 0.008126 ms, TOPS: 37.16
H: 16, W: 16, CIN: 384, COUT: 384, W_BIT: 1, X_BIT: 2, Time: 0.010251 ms, TOPS: 66.29
H: 16, W: 16, CIN: 512, COUT: 512, W_BIT: 1, X_BIT: 2, Time: 0.010370 ms, TOPS: 116.48
H: 16, W: 16, CIN: 640, COUT: 640, W_BIT: 1, X_BIT: 2, Time: 0.013166 ms, TOPS: 143.35
H: 16, W: 16, CIN: 768, COUT: 768, W_BIT: 1, X_BIT: 2, Time: 0.024899 ms, TOPS: 109.16
H: 16, W: 16, CIN: 896, COUT: 896, W_BIT: 1, X_BIT: 2, Time: 0.028499 ms, TOPS: 129.81
H: 16, W: 16, CIN: 1024, COUT: 1024, W_BIT: 1, X_BIT: 2, Time: 0.025389 ms, TOPS: 190.31

APNN-TC vs CUTLASS on NN model.

AlexNet(ms)VGG(ms)
cutlass-324.2625.22
cutlass-163.7924.19
APNN-TC-w1a20.361.66
Speedup (FP32)11.71x15.24x
Speedup (FP16)10.40x14.62x

Observations.

CUTLASS-VGG-variant-b256 (ms)
FP32628.254
FP16540.707
INT8368.626

[Updated] BNN for NN model.

CurrentTable-2
AlexNet0.6310.69
VGG2.2332.17
ResNet0.7330.68

Note that for the BNN-based NN model we use in our paper submission, we adopt the design from this TCBNN (from TPDS-20) for the state-of-the-art BNN implementation on GPU tensor core, which can match the number in the Table-2.

[Updated] APNN-TC NN model layer-wise latency breakdown.

Conv1, 224, 224, 3, 64, 11, 11
Conv2, 28, 28, 64, 192, 5, 5
Conv3, 14, 14, 192, 384, 3, 3
Conv4, 14, 14, 384, 256, 3, 3
Conv5, 14, 14, 256, 256, 3, 3
Fc1, 12544, 4096
Fc2, 4096, 4096
Fout, 4096, 1000

==============
AlexNet (ms): 0.372
AlexNet Layer-0 (ms): 0.241
AlexNet Layer-1 (ms): 0.018
AlexNet Layer-2 (ms): 0.003
AlexNet Layer-3 (ms): 0.046
AlexNet Layer-4 (ms): 0.023
AlexNet Layer-5 (ms): 0.010
AlexNet Layer-6 (ms): 0.007
AlexNet Layer-7 (ms): 0.009