Home

Awesome

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

<p align="middle"> <a href="https://github.com/Squeezebits/QUICK/releases"><img alt="GitHub - Releases" src="https://img.shields.io/github/release/Squeezebits/QUICK.svg"/></a> <a href="https://arxiv.org/abs/2402.10076"><img src="https://img.shields.io/badge/arXiv-2402.10076-b31b1b.svg" alt="arXiv"/></a> </p>

Introducing QUICK, a collection of novel optimized CUDA kernels designed for faster inference of quantized Large Language Models (LLMs). QUICK addresses the shared memory write-back bank conflict issue in state-of-the-art mixed precision General Matrix Multiplication (GEMM) kernels.

Computation overview of original kernel and QUICK​​

🏎️ Why QUICK?

📝 About QUICK

QUICK eliminates shared memory write-back bank conflicts introduced in previous mixed precision GEMM kernels.

The bank conflicts arise when the dequantized weights are written back to shared memory for subsequent computations. Consequently, bank conflicts induce a significant number of stalls, thereby deteriorating the overall throughput of mixed precision GEMM, especially for workloads with large batches.

QUICK rearranges the quantized weight matrix offline to remove the bank conflicts effectively. This rearrangement aligns with the load and computation pattern of Tensor Cores in NVIDIA GPUs without the need for shared memory write-back.

🚀 Install

📖 Prerequisites

🏗️ Build from source

pip install git+https://github.com/SqueezeBits/QUICK
git clone https://github.com/SqueezeBits/QUICK
cd QUICK
pip install -e .

🔍 Usage

  1. Quantization: Perform AWQ with our kernel(QUICK) or original AWQ kernel(GEMM)
python examples/basic_quant.py --model_path </path/to/hf-model> --quant_path </path/to/save/quant-model>  --version <QUICK or GEMM>
  1. Evaluation: Evaluate the quantized model on several tasks (we tested on 'wikitext' dataset)
python examples/eval.py --model_path </path/to/quant-model> --tasks <tasks_to_evaluate>
  1. Benchmark: You can check the end-to-end benchmark data we attached below on your machine.
python examples/benchmark.py --model_path </path/to/quant-model> --batch_size N

Below is an example for the simplest use of auto_awq with QUICK to quantize a model and inference after quantization:

<details> <summary>Quantization & Inference</summary>

Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

from quick.awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

model_path = 'mistralai/Mistral-7B-v0.1'
quant_path = 'Mistral-7B-v0.1-awq-QUICK'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "QUICK" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

# Convert prompt to tokens
prompt_template = "[INST] {prompt} [/INST]"

# prompt = "Explain quantum physics to a five-year-old using only metaphors."
prompt = "What is the birth year of Albert Einstein?"\
        "and what famous equation is Albert Einstein known for?"

tokens = tokenizer(
    prompt_template.format(prompt=prompt), 
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_new_tokens=512
)
</details>

📊 Benchmarks

These benchmarks highlight the improvement in both single mixed precision GEMM throughput and inference throughput of weight-only quantized LLMs. The results include measurements in Tera Operations per Second (TOPS) for single matrix multiplications across various M sizes, considering a matrix multiplication workload with a shape of (M x K x N), conducted on multiple GPU devices. Additionally, the benchmarks demonstrate the token generation throughput gain of representative weight-only quantized LLMs across diverse GPU devices. In the end-to-end benchmarks, we fixed the prefill length and decode length to 128 each in order to test various batch sizes on single GPUs.

To ensure fairness in testing, we used the same benchmark script in AutoAWQ. Notably, we observed that the perplexity of quantized LLMs remains consistent compared to AutoAWQ when using QUICK. It's important to note that benchmark data may vary across different GPUs and CPUs, as well as among different inference frameworks.

📈 Kernel benchmarks

Kernel Benchmark

<div align="center">
DeviceBatch SizeFP16 (TFLOPS)AWQ (TFLOPS)QUICK (TFLOPS)Speedup-FP16Speedup-AWQ
RTX 409010.83.113.12290%0%
6456.491.39111.9999%23%
128138.2104.36138.590%33%
A600010.72.212.28226%3%
6439.444.8580.42104%79%
12881.746.0583.622%82%
L4010.72.412.51259%4%
6444.472.997.73120%34%
12814864.28107.67-27%68%
A10011.42.943.52151%20%
1622.634.2348.14113%41%
3246.448.8369.0349%41%
6491.957.4376.31-17%33%
128157.458.4694.03-40%61%
</div>

"Speedup-FP16/AWQ" means the extent to which QUICK has become faster compared to FP16/AWQ kenel.

📈 End-to-end benchmarks

E2E Benchmark

<center>
ModelDeviceBatch SizeFP16 (tok/s)AWQ (tok/s)QUICK (tok/s)Speedup-FP16Speedup-AWQ
Mistral-7BRTX 4090152.8154.0137.3160%-11%
642985.64465.94539.852%2%
256OOM5156.97316.9-42%
Vicuna-13BA6000123.665.068.5191%5%
641194.01241.32023.469%63%
256OOM1332.12330.2-75%
Llama-2-13BL40123.570.272.5208%3%
641315.21580.42262.472%43%
256OOM1611.33122.4-94%
Llama-30BA100120.036.731.155%-15%
64OOM695.21207.9-74%
128OOM759.41165.2-53%
</center>

📈 vLLM benchmarks

We are actively working on the integration of QUICK into widely-used LLM frameworks. In this section, we present the throughput benchmark results of our initial version of <a href="https://github.com/vllm-project/vllm">vLLM</a> integrated with QUICK.

<div align="center">
ModelFP16 (tok/s)AWQ (tok/s)QUICK (tok/s)Speedup-FP16Speedup-AWQ
Vicuna-13B985.21030.41308.633%27%
Llama-2-70BOOM224.3290.2-29%
</div>

🙋‍♂️ Frequently Asked Questions

<details> <summary>Inference fails due to the absence of the layernorm kernel.</summary> </details>

📚 Cite

If you find our code or QUICK useful for your research, please consider citing:

@misc{kim2024quick,
      title={QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference}, 
      author={Taesu Kim and Jongho Lee and Daehyun Ahn and Sarang Kim and Jiwoong Choi and Minkyu Kim and Hyungjun Kim},
      year={2024},
      eprint={2402.10076},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}