Home

Awesome

Introduction

This code implements fast cuda kernels for DNN inference, especially for convolution layers / residule blocks in ResNet. Specifically, the kernels combine three parts into one piece:

For implementation details, please refer to the technical report included in this repo. Winograd algorithm is used for 3 * 3 convolutional kernels.

Usage

mkdir data
python data_generator.py
make
./Test 0

Results

3 * 3 Kernels

KernalsOperations128 / 128256 / 256
CudnnGemm + BN + ReLU214us384us
CudnnWinograd + BN + ReLU95us155us
Our KernelWinograd + BN + ReLU59us117us

1 * 1 Kernels [BUGGY NUMBERS]

Kernals512 / 128128 / 5121024 / 256256 / 1024
OperationsGemm + BN + ReLUGemm + BNGemm + BN + ReLUGemm + BN + ReLU
Cudnn119us115us219us214us
Our Kernel58us55us186us181us