Awesome
FastGEMV
This repository provides a collection of kernel functions that enable high-speed computation of GEMV (matrix-vector dot product).
We have implemented and benchmarked the following scenarios:
- matrix: fp16, vector: fp16;
- matrix: int8 (quantized with fp16 scale/zero point), vector: fp16;
- matrix: int4 (quantized with fp16 scale/zero point), vector: fp16.
The matrix and vector sizes range from 512 to 16384.
On P100 GPUs, we achieved a maximum speedup of 2.7x compared to the PyTorch baseline. On 3090 Ti GPUs, we achieved a maximum speedup of 1.4x.
Requirements
sudo apt install -y cuda-11-7 nsight-systems-2023.1.2 nsight-compute-2023.1.1
Usage
Running the Baseline (PyTorch)
Ensure that PyTorch is correctly installed.
# using Nsight
nsys profile --stats=true --force-overwrite true -o <report_name> python baseline.py -size <size>
For the baseline results, please refer to here.
Running FastGEMV (this repository)
make
./gemv [-s <size> -x <blockDim.x> -y <blockDim.y> -i <num_iterations> -b <bits_per_data> -u <scale> -v <zero_point>]
# if using Nsight, the following command will generate detailed report of each function / kernel
nsys profile --stats=true --force-overwrite true -o <report_name> ./gemv \
[-s <size> -x <blockDim.x> -y <blockDim.y> -i <num_iterations> -b <bits_per_data> -u <scale> -v <zero_point>]
-
size
: Should be a power of 2 integer, e.g.: 512, 1024, 2048, ... For example, whensize = 512
, the matrix is 512 512, and the vector is 512 1. -
blockDim.x
andblockDim.y
: The block dimension when launching the kernel. Both should be power of 2 integers. -
num_iterations
: Number of iterations to perform the test. When profiling the runtime with Nsight, using a large iteration count provides more precise values. -
bits_per_data
: Set to 16 for fp16 matrix, 8 for int8 matrix, and 4 for int4 matrix. The vector is always in fp16. -
scale
andzero_point
: Only applicable for the quantized version. Default values are 0.0625 forscale
and 0.01 forzero_point
.
Other constraints:
-
blockDim.x * blockDim.y
should be less than or equal to the maximum number of threads in a block (e.g., 1024). -
blockDim.y
should be less than or equal to 64, based on the size of the shared memory used in the kernel. -
For fp16 or int8,
size / blockDim.x
should be greater than or equal to 8. For int4,size / blockDim.x
should be greater than or equal to 16.
Example:
./gemv -s 16384 -x 32 -y 8 -i 10000
The above command runs a GEMV with a 16k 16k matrix and a 16k 1 vector for 10000 iterations, using the following parameters:
blockDim
is set to (32, 8).bits_per_data
is set to the default value of 16, indicating that both the matrix and vector are in fp16.- No scale or zero_point values are required.
Workflow
When running the ./gemv
program, it first generates the matrix and vector data based on the size and bits specified by the user. All data is generated using curand
. Then the program performs GEMV computations for num_iterations
based on the blockDim
and gridDim
generated from the user input. Finally, the program verifies the correctness of the result. If any errors are found, it prints the incorrect indexes and values. If the test passes, no indexes are printed.
Users can try different blockDim
parameters to find the best combinations for different settings and hardware.
Optimization Strategy and Results
Please refer to here for more details.