Awesome

introduction

A simple high performance CUDA GEMM, Block Sparse GEMM and Non-uniform Quantized GEMM implementation.

C = alpha * A * B + beta * C

algorithm

located in src/cuda/

MatrixMulCUDA
- one element of C is assigned one thread
- global memory coalesce of B
MatrixMulCUDA1
- texture load
MatrixMulCUDA2
- one 4 * 4 grid of C is assigned one thread
MatrixMulCUDA3
- vectorized A B load
MatrixMulCUDA4
- vectorized C store
MatrixMulCUDA5
- block sparse version
MatrixMulCUDA6
- vectorized A B load coalesce
MatrixMulCUDA7
- warp shuffle to enable C store coalesce
MatrixMulCUDAQuantize8bit
- 8 bit non-uniform quantized matmul

experiments

located in benchmark/

benchmark_dense
- Compare My Gemm with Cublas
benchmark_sparse
- Compare My block sparse Gemm with Cusparse
benchmark_quantization_8bit
- Compare My Gemm with Cublas
benchmark_quantization
- Compare My Gemm with My quantized non-uniform 8 bit Gemm

TODO

(MatrixMulCUDA7) write back to C matrix, warp shuffle to enable global memory coalesce
(MatrixMulCUDA8) double buffering

run

mkdir builds
make benchmark_[experiment name]
bash scripts/benchmark_[experiment name].sh

Note

sparsity约为1%的时候, cusparse的性能可以超越cublas
合理分配寄存器尽可能让参数在编译器确定节省计算资源和寄存器数目