Home

Awesome

Large Foundation Model Quantization (LMQuant)

LMQuant is an open source large foundation models quantization toolbox based on PyTorch. LMQuant is implemented by QServe, an efficient GPU inference library.

The current release supports:

News

Contents

Installation

  1. Clone this repository and navigate to lmquant folder
git clone https://github.com/mit-han-lab/lmquant
cd lmquant
  1. Install Package
conda env create -f environment.yml -n lmquant
conda activate lmquant
poetry install

Highlights

QServe: W4A8KV4 Quantization for Efficient LLM Serving

[Website][Paper][QoQ Algorithm Code][QServe GPU System]

Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2× on A100, 1.4× on L40S; and Qwen1.5-72B by 2.4× on A100, 3.5× on L40S, compared to TensorRT-LLM.

QoQ-QServe QoQ

Model Zoo

We provide QoQ quantized model checkpoints in QServe for your reference.

Perplexity Evaluation

Below is the WikiText2 perplexity evaluated with 2048 sequence length. The lower is the better.

ModelsPrecisionLlama-3 8BLlama-2 7BLlama-2 13BLlama-2 70BLlama 7BLlama 13BLlama 30BMistral 7BYi 34B
FP166.145.474.883.325.685.094.105.254.60
SmoothQuantW8A86.285.544.953.365.735.134.235.294.69
GPTQ-RW4A16 g1286.565.634.993.435.835.204.225.394.68
AWQW4A16 g1286.545.604.973.415.785.194.215.374.67
QuaRotW4A48.336.195.453.836.345.584.645.77NaN
AtomW4A4 g1287.766.125.313.736.255.524.615.764.97
QoQW4A8KV46.895.755.123.525.935.284.345.454.74
QoQW4A8KV4 g1286.765.705.083.475.895.254.285.424.76

* SmoothQuant is evaluated with per-tensor static KV cache quantization.

Efficiency Benchmarks

When serving the large language models Llama-3-8B and Qwen1.5-72B on L40S and A100 GPUs, QServe demonstrates superior performance, achieving 1.2x-1.4x higher throughput compared to the leading industry solution, TensorRT-LLM, for Llama-3-8B, and a 2.4x-3.5x higher throughput for Qwen1.5-72B.

See more about benchmarking setting in QServe GPU Inference System.

L40S (48G)Llama-3-8BLlama-2-7BMistral-7BLlama-2-13BLlama-30BYi-34BLlama-2-70BQwen-1.5-72B
TRT-LLM-FP161326444156692OOMOOMOOMOOM
TRT-LLM-W4A161431681145736814831311917
TRT-LLM-W8A8263412712569440123364OOMOOM
Atom-W4A4--2120------------
QuaRot-W4A4--805--413133----15
QServe-W4A8KV4365623943774132750486928659
Throughput Increase*1.39x1.13x1.47x3.02x3.41x2.39x2.40x3.47x
A100 (80G)Llama-3-8BLlama-2-7BMistral-7BLlama-2-13BLlama-30BYi-34BLlama-2-70BQwen-1.5-72B
TRT-LLM-FP1625031549237148880145OOMOOM
TRT-LLM-W4A16237015492403871352569358143
TRT-LLM-W8A8239623342427127736164923553
Atom-W4A4--1160------------
QuaRot-W4A4--1370--289267----68
QServe-W4A8KV43005290829701741749803419340
Throughput Increase*1.20x1.25x1.22x1.36x2.07x1.23x1.17x2.38x

The absolute token generation throughputs of QServe and baseline systems (Unit: tokens/second. -- means unsupported). All experiments were conducted under the same device memory budget. Throughput increase of QServe is calculated with regard to the best baseline in each column.

Support List

Large Language Model Quantization

ModelsSizesQoQ (W4A8KV4)AWQ (W4A16)GPTQ-R (W4A16)SmoothQuant (W8A8)
Llama38B/70B
Llama27B/13B/70B
Llama7B/13B/30B
Mistral7B
Mixtral8x7B
Yi34B

Reference

If you find lmquant useful or relevant to your research, please kindly cite our paper:

@article{lin2024qserve,
  title={QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving},
  author={Lin*, Yujun and Tang*, Haotian and Yang*, Shang and Zhang, Zhekai and Xiao, Guangxuan and Gan, Chuang and Han, Song},
  journal={arXiv preprint arXiv:2405.04532},
  year={2024}
}

Related Projects

The following projects are highly related to QServe. Our group has developed full-stack application-algorithm-system-hardware support for efficient large models, receiving 9k+ GitHub stars and over 1M Huggingface community downloads.

You are also welcome to check out MIT HAN LAB for other exciting projects on Efficient Generative AI!

Acknowledgement

LMQuant is inspired by many open-source libraries, including (but not limited to) GPTQ, QuaRot and Atom.