Home

Awesome

QQQ: Quality Quattuor-Bit Quantization for Large Language Models

[Paper]

Quantization is a proven effective method for compressing large language models. Although popular techniques like W8A8 and W4A16 effectively maintain model performance, they often fail to concurrently speed up the prefill and decoding stages of inference. W4A8 is a promising strategy to accelerate both of them while usually leads to a significant performance degradation. To address these issues, we present QQQ, a Quality Quattuor-bit Quantization method with 4-bit weights and 8-bit activations. QQQ employs adaptive smoothing and Hessian-based compensation, significantly enhancing the performance of quantized models without extensive training. Furthermore, we meticulously engineer W4A8 GEMM kernels to increase inference speed. Our specialized per-channel W4A8 GEMM and per-group W4A8 GEMM achieve impressive speed increases of 3.67x and 3.29x over FP16 GEMM. Our extensive experiments show that QQQ achieves performance on par with existing state-of-the-art LLM quantization methods while significantly accelerating inference, achieving speed boosts up to 2.24x, 2.10x, and 1.25x compared to FP16, W8A8, and W4A16, respectively.

News or Update

Install

Prerequisites

Build from source

Currently this repo only support build form source.

git clone https://github.com/HandH1998/QQQ.git
cd QQQ
git submodule update --init --recursive
pip install -v -e .

Supported models

Model support list:

ModelsSizes
LLaMA-17B/13B/30B/65B
LLaMA-27B/13B/70B
LLaMA-38B/70B
Qwen20.5B/1.5B/7B/72B

Usage

You can quickly perform model quantization, model evaluation, and simple model inference using the scripts (quant_model.sh, eval_model.sh and test_model.sh) in the scripts directory.

Quantize model

Here is an example for quantizing a model with per-channel weight quantization.

python3 examples/quant_model.py \
--model_path ${model_path} \
--tokenizer_path ${tokenizer_path} \
--dtype float16 \
--smooth false \
--rotation true \
--dataset wikitext2 \
--nsamples 128 \
--w_quantizer FixedQuantize \
--w_group_size -1 \
--gptq_mse true \
--gptq_groupsize -1 \
--save_path ${save_path} \

Evaluate Model

Here is an example for evaluating perplexity on WikiText2 and accuracy on some zero-shot tasks.

python3 examples/eval_model.py \
--model_path ${quantized_model_path} \
--tokenizer_path ${tokenizer_path} \
--tasks piqa,winogrande,hellaswag,arc_challenge,arc_easy \ # lm_eval tasks
--eval_ppl \ # whether evaluate perplexity on WikiText2
--batch_size 8 \
--max_length 2048 

Inference

Key results

Model performance

We evaluated the model performance on WikiText2 and five zero-shot tasks. model_performance

Throughput

We conducted the same-batch throughput comparison of quantized LLaMA-2 models under various batch sizes. The input sequence length is 1024 and the output sequence length is 128. speedup

W4A8 GEMM performance

Here is the speedup over PyTorch FP16 GEMM (Calling CUTLASS) of all GEMMs under different numbers of input tokens. The weight matrix size is (N=8192, K=21760). gemm_performance

Acknowledgement

Reference

If you find QQQ useful or relevant to your research, please cite our paper:

@article{zhang2024qqq,
      title={QQQ: Quality Quattuor-Bit Quantization for Large Language Models}, 
      author={Ying Zhang and Peng Zhang and Mincong Huang and Jingyang Xiang and Yujie Wang and Chao Wang and Yineng Zhang and Lei Yu and Chuan Liu and Wei Lin},
      journal={arXiv preprint arXiv:2406.09904},
      year={2024}
}