Home

Awesome

QuantEase

This repository contains the codes for the paper QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm

Abstract

With the growing popularity of Large Language Models (LLMs), there is an increasing interest in compression techniques for their efficient deployment. This study focuses on Post-Training Quantization for LLMs, introducing QuantEase, a layer-wise quantization framework where individual layers undergo separate quantization. Framing the problem as discrete-structured non-convex optimization, our work develops Coordinate Descent techniques, offering high-quality solutions without the need for matrix inversion or decomposition. We also explore an outlier-aware variant, preserving significant weights with complete precision. Our proposal achieves state-of-the-art performance in empirical evaluations across various LLMs and datasets, with up to 15% improvements over methods like GPTQ. With careful linear algebra optimizations, QuantEase quantizes models like Falcon-180B on a single NVIDIA A100 GPU in approximately three hours. The outlier-aware algorithm achieves near or sub-3-bit quantization with an acceptable accuracy drop, outperforming methods like SpQR by up to two times in terms of perplexity.

Selected WikiText2 perplexity results for BLOOM, OPT, and Falcon model family without grouping:

Model NameFP164bit3bit3bit-structured outlier (1%)3bit-unstructured outlier (1%)
OPT-1.3B14.6215.2821.3018.5116.25
OPT-13B10.1310.3212.4112.0710.37
BLOOM-1B715.3916.1120.0318.8917.06
BLOOM-7B111.3711.6913.4312.9712.03
Falcon-7B6.596.928.838.567.14
Falcon-40B5.235.466.205.995.51

Zero-Shot accuracy on the LAMBADA benchmark for 3-bit and 4-bit quantization: lambada

Contained Algorithms

How to run the scripts

Prepare datasets and models

Install required dependencies

pip3 install -r requirements.txt

All scripts have been tested with single A100 NVIDIA GPU machine with CUDA 12.0 driver API version and 11.2 runtime API version.

Running the quantization scripts with evaluations

We currently support the quantization of three model families: BLOOM, OPT, Falcon, Mistral-7b, LLAMA

# within the `QuantEase` root folder run:
python3 model_quantizer.py --model `models/<model_name>` --dataset  c4 --wbits 4 --num-iter 30 --nsamples 128 --true-sequential --quantization-method <algorithm_name>

To enable outlier-aware algorithm, please provide extra argument: --outlier:

# within the `QuantEase` root folder run:
python3 model_quantizer.py --model `models/<model_name>` --dataset  c4 --wbits 4 --num-iter 30 --nsamples 128 --true-sequential --quantization-method <outlier_aware_algorithm_name> --outlier 0.01

Extra supported arguments

Cite this work

Biblatex entry:

@article{behdin2023quantease,
  title={QuantEase: Optimization-based Quantization for Language Models--An Efficient and Intuitive Algorithm},
  author={Behdin, Kayhan and Acharya, Ayan and Gupta, Aman and Keerthi, Sathiya and Mazumder, Rahul and Siyu, Zhu and Qingquan, Song},
  journal={arXiv preprint arXiv:2309.01885},
  year={2023}
}

Acknowledgements

Kayhan Behdin contributed to this work while he was an intern at LinkedIn during summer 2023. This work is not a part of his MIT research. Rahul Mazumder contributed to this work while he was a consultant for LinkedIn (in compliance with MIT’s outside professional activities policies). This work is not a part of his MIT research.