Home

Awesome

FastGEMV

This repository provides a collection of kernel functions that enable high-speed computation of GEMV (matrix-vector dot product).

We have implemented and benchmarked the following scenarios:

The matrix and vector sizes range from 512 to 16384.

On P100 GPUs, we achieved a maximum speedup of 2.7x compared to the PyTorch baseline. On 3090 Ti GPUs, we achieved a maximum speedup of 1.4x.

Requirements

sudo apt install -y cuda-11-7 nsight-systems-2023.1.2 nsight-compute-2023.1.1

Usage

Running the Baseline (PyTorch)

Ensure that PyTorch is correctly installed.

# using Nsight
nsys profile --stats=true --force-overwrite true -o <report_name> python baseline.py -size <size>

For the baseline results, please refer to here.

Running FastGEMV (this repository)

make
./gemv [-s <size> -x <blockDim.x> -y <blockDim.y> -i <num_iterations> -b <bits_per_data> -u <scale> -v <zero_point>]
# if using Nsight, the following command will generate detailed report of each function / kernel
nsys profile --stats=true --force-overwrite true -o <report_name> ./gemv \
    [-s <size> -x <blockDim.x> -y <blockDim.y> -i <num_iterations> -b <bits_per_data> -u <scale> -v <zero_point>]

Other constraints:

Example:

./gemv -s 16384 -x 32 -y 8 -i 10000

The above command runs a GEMV with a 16k 16k matrix and a 16k 1 vector for 10000 iterations, using the following parameters:

Workflow

When running the ./gemv program, it first generates the matrix and vector data based on the size and bits specified by the user. All data is generated using curand. Then the program performs GEMV computations for num_iterations based on the blockDim and gridDim generated from the user input. Finally, the program verifies the correctness of the result. If any errors are found, it prints the incorrect indexes and values. If the test passes, no indexes are printed.

Users can try different blockDim parameters to find the best combinations for different settings and hardware.

Optimization Strategy and Results

Please refer to here for more details.