Home

Awesome

SqueezeLLM: Dense-and-Sparse Quantization [Paper]

Thumbnail

SqueezeLLM is a post-training quantization framework that incorporates a new method called Dense-and-Sparse Quantization to enable efficient LLM serving.

TLDR: Deploying LLMs is difficult due to their large memory size. This can be addressed with reduced precision quantization. But a naive method hurts performance. We address this with a new Dense-and-Sparse Quantization method. Dense-and-Sparse splits weight matrices into two components: A dense component that can be heavily quantized without affecting model performance, as well as a sparse part that preserves sensitive and outlier parts of the weight matrices With this approach, we are able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality. For instance, the Squeeze variant of the Vicuna models can be served within 6 GB of memory and reach 2% higher MMLU than the baseline model in FP16 with an even 2x larger memory footprint. For more details please check out our paper.

Updates (2/5): Dense and sparse quantization and packing codes for custom models are now available.

Updates (11/28): Mistral model is now supported.

News (10/21): SqueezeLLM is now supported within the official vLLM framework.

Updates (9/30): The code for quantizing custom models is now available (link).


Installation

  1. Create a conda environment
conda create --name sqllm python=3.9 -y
conda activate sqllm
  1. Clone and install the dependencies
git clone https://github.com/SqueezeAILab/SqueezeLLM
cd SqueezeLLM
pip install -e .
cd squeezellm
python setup_cuda.py install

From-scratch Quantization

To quantize your own models, follow the procedure in this link.

Supported Models

Currently, we support LLaMA 7B, 13B, 30B and 65B, LLaMA-2 7B and 13B, instruction-tuned Vicuna 7B and 13B, XGen 7B with 8k sequence length, and OPT 1.3B to 30B. For each model, we support 3-bit and 4-bit quantized models, with sparse levels of 0% (dense-only), 0.05%, and 0.45%. See our Paper for more detailed information on these configurations. Below are the links to download the models.

LLaMA (v1)

ModelBitwidthDense-only (0%)0.05% Sparsity0.45% sparsity
LLaMA-7B3sq-llama-7b-w3-s0sq-llama-7b-w3-s5sq-llama-7b-w3-s45
LLaMA-7B4sq-llama-7b-w4-s0sq-llama-7b-w4-s5sq-llama-7b-w4-s45
LLaMA-13B3sq-llama-13b-w3-s0sq-llama-13b-w3-s5sq-llama-13b-w3-s45
LLaMA-13B4sq-llama-13b-w4-s0sq-llama-13b-w4-s5sq-llama-13b-w4-s45
LLaMA-30B3sq-llama-30b-w3-s0sq-llama-30b-w3-s5sq-llama-30b-w3-s45
LLaMA-30B4sq-llama-30b-w4-s0sq-llama-30b-w4-s5sq-llama-30b-w4-s45
LLaMA-65B3sq-llama-65b-w3-s0sq-llama-65b-w3-s5sq-llama-65b-w3-s45
LLaMA-65B4sq-llama-65b-w4-s0sq-llama-65b-w4-s5sq-llama-65b-w4-s45

LLaMA-2

ModelBitwidthDense-only (0%)
LLaMA-2-7B3sq-llama-7b-w3-s0
LLaMA-2-7B4sq-llama-7b-w4-s0
LLaMA-2-13B3sq-llama-13b-w3-s0
LLaMA-2-13B4sq-llama-13b-w4-s0

Mistral

ModelBitwidthDense-only (0%)
Mistral-7B3sq-mistral-7b-w3-s0
Mistral-7B4sq-mistral-7b-w4-s0
Mistral-7B-instruct3sq-mistral-7b-instruct-w3-s0
Mistral-7B-instruct4sq-mistral-7b-instruct-w4-s0

Vicuna (v1.1)

ModelBitwidthDense-only (0%)0.45% sparsity
Vicuna-7B3sq-vicuna-7b-w3-s0sq-vicuna-7b-w3-s45
Vicuna-7B4sq-vicuna-7b-w4-s0sq-vicuna-7b-w4-s45
Vicuna-13B3sq-vicuna-13b-w3-s0sq-vicuna-13b-w3-s45
Vicuna-13B4sq-vicuna-13b-w4-s0sq-vicuna-13b-w4-s45

Vicuna (v1.3)

Please refer to the Fastchat documentation for more details about the differences between v1.1 vs v1.3.

ModelBitwidthDense-only (0%)
Vicuna-7B-v1.33sq-vicuna-7b-v1.3-w3-s0
Vicuna-7B-v1.34sq-vicuna-7b-v1.3-w4-s0
Vicuna-13B-v1.33sq-vicuna-7b-v1.3-w3-s0
Vicuna-13B-v1.34sq-vicuna-7b-v1.3-w4-s0
Vicuna-30B-v1.33Coming Soon
Vicuna-30B-v1.34Coming Soon

XGen (8k Sequence length)

XGen-7B-8k-Base is a 7B model pre-trained under 8K sequence length. XGen-7B-8k-Inst is a supervised finetuned model on public domain instructional data for instruction following applications. Please refer to the blog post from Salesforce AI Research for more details on the models.

ModelBitwidthDense-only (0%)0.45% sparsity
XGen-7B-8k-Base3sq-xgen-7b-8k-base-w3-s0sq-xgen-7b-8k-base-w3-s45
XGen-7B-8k-Base4sq-xgen-7b-8k-base-w4-s0sq-xgen-7b-8k-base-w4-s45
XGen-7B-8k-Inst3sq-xgen-7b-8k-inst-w3-s0sq-xgen-7b-8k-inst-w3-s45
XGen-7B-8k-Inst4sq-xgen-7b-8k-inst-w4-s0sq-xgen-7b-8k-inst-w4-s45

OPT

ModelBitwidthDense-only (0%)0.45% sparsity
OPT-1.3B3sq-opt-1.3b-w3-s0sq-opt-1.3b-w3-s50
OPT-1.3B4sq-opt-1.3b-w4-s0sq-opt-1.3b-w4-s50
OPT-2.7B3sq-opt-2.7b-w3-s0sq-opt-2.7b-w3-s50
OPT-2.7B4sq-opt-2.7b-w4-s0sq-opt-2.7b-w4-s50
OPT-6.7B3sq-opt-6.7b-w3-s0sq-opt-6.7b-w3-s50
OPT-6.7B4sq-opt-6.7b-w4-s0sq-opt-6.7b-w4-s50
OPT-13B3sq-opt-13b-w3-s0sq-opt-13b-w3-s50
OPT-13B4sq-opt-13b-w4-s0sq-opt-13b-w4-s50
OPT-30B3sq-opt-30b-w3-s0sq-opt-30b-w3-s50
OPT-30B4sq-opt-30b-w4-s0sq-opt-30b-w4-s50

Running the Models

Benchmarking

The following code will run and benchmark the 3-bit quantized models on the C4 dataset. The --torch_profile argument can be passed when running benchmarking to replicate the runtime results from the paper. Download the quantized model (e.g. sq-llama-7b-w3-s0.pt or sq-xgen-7b-8k-base-w3-s0.py) locally from the links above.

Note that for the LLaMA (v1) and Vicuna v1.1 models, you need to first obtain the original, pre-trained LLaMA model in the Huggingface-compatible format locally and provide the path in {model_path}. For other model types (e.g. Vicuna v1.3, LLaMA-2, XGen, etc.), you don't need to install/download the original models separately as we provide Huggingface compatible configs of all supported models in models. You can follow the same procedure for other model types and quantization settings such as bit width and sparsity level.

# LLaMA Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --benchmark 128 --check --torch_profile

# XGen Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --benchmark 128 --check --torch_profile

When using checkpoints with sparsity (i.e. non-zero sparsity level), the --include_sparse flag should also be passed:

# LLaMA Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s5.pt --include_sparse --benchmark 128 --check --torch_profile

# XGen Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --include_sparse --benchmark 128 --check --torch_profile

NOTE: In order to reproduce the perplexity numbers in our paper, please use --eval instead of --benchmark, following the instruction below.

Perplexity Evaluation

The following code will evaluate perplexity using the 3-bit quantized models on the C4 dataset, following the same evaluation methodology of GPTQ and GPTQ-For-LLaMA. This will reproduce the perplexity numbers reported in our paper. Download the quantized model (e.g. sq-llama-7b-w3-s0.pt or sq-xgen-7b-8k-base-w3-s0.py) locally from the links above.

Note that for the LLaMA (v1) and Vicuna v1.1 models, you need to first obtain the original, pre-trained LLaMA model in the Huggingface-compatible format locally and provide the path in {model_path}. For other model types (e.g. Vicuna v1.3, LLaMA-2, XGen, etc.), you don't need to install/download the original models separately as we provide Huggingface compatible configs of all supported models in models. You can follow the same procedure for other model types and quantization settings such as bit width and sparsity level.

# LLaMA Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --eval

# XGen Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --eval

When using checkpoints with sparsity (i.e. non-zero sparsity level), the --include_sparse flag should also be passed:

# LLaMA Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py {model_path} c4 --wbits 3 --load sq-llama-7b-w3-s0.pt --include_sparse --eval

# XGen Perplexity Evaluation
CUDA_VISIBLE_DEVICES=0 python llama.py models/xgen-7b-8k-base c4 --wbits 3 --load sq-xgen-7b-8k-base-w3-s0.pt --include_sparse --eval

The code was tested on A5000 and A6000 GPUs with Cuda 11.3 and CUDNN 8.2.


Acknowledgement

This code reuses components from several libraries including GPTQ as well as GPTQ-For-LLaMA.


Citation

SqueezeLLM has been developed as part of the following paper. We appreciate it if you would please cite the following paper if you found the library useful for your work:

@article{kim2023squeezellm,
  title={SqueezeLLM: Dense-and-Sparse Quantization},
  author={Kim, Sehoon and Hooper, Coleman and Gholami, Amir and Dong, Zhen and Li, Xiuyu and Shen, Sheng and Mahoney, Michael and Keutzer, Kurt},
  journal={arXiv},
  year={2023}
}