Home

Awesome

QTIP: Quantization with Trellises and Incoherence Processing [arXiv]

<img src="assets/qtip_overview.PNG" width="800">

This repository contains code for QTIP, a weight-only large language model (LLM) quantization method that achieves a state-of-the-art combination of quantization quality and speed. QTIP uses incoherence processing to make LLM weight matrices approximately i.i.d Gaussian, and then uses trellis coded quantization (TCQ) to quantize these weights with near-optimal distortion. QTIP solves naive TCQ's inherent slowness by introducing a series of novel compute-based codes for use with the "bitshift trellis." For more details, please see the paper.

How to use this codebase

This codebase is based off of the QuIP# codebase, with modifications made to support trellis quantization. The main QTIP code is in lib/codebook/bitshift.py, and the QuIP# algorithm files have been merged into lib/algo/finetune.py. Example scripts can be found in examples/

The main QTIP-related arguments in quantize_llama/quantize_finetune_llama.py are:

Fast inference

QTIP achieves the same inference throughput as QuIP# despite achieving higher quality quantization. The numbers below measure bs=1 inference speed on a RTX6000 Ada with matrix fusion (q, k, and v, and up and gate together) for QuIP# and QTIP.

Method2-7B2-70B
FP1655.9 tok/sOOM
AQLM 2 Bit81.58.78
QuIP# 2 Bit18622.2
QTIP 2 Bit18823.5

This codebase contains 2 bit matrix-vector multiplication kernels for the HYB code with L=16, Q=9, V=2, and $T_x = T_y = 16$. 3 and 4 bit kernels are coming soon. These kernels are located in qtip_kernels and have been integrated into the BitshiftLinear class in lib/codebook/bitshift.py. eval/interactive_gen.py contains a simple generation script that is compatible with those kernels and CUDA graphs (through torch.compile). This script does not implement matrix fusion so you will not get get the speeds in the table if you run it. If you wish to quantize a model with matrix fusion, the QuIP# codebase has plumbing to do so and should mostly translate over to this one.

Compiling the kernels

cd qtip-kernels
python setup.py install

Prequantized Models

Below are some prequantized Llama 2 models using QTIP with the HYB code, L=16, V=2, and K (bitrate) = 2. These models are compatible with the inference kernels in qtip-kernels and can be used by passing in the HF Hub path to the --hf-path flag in the eval scripts.

ModelHF Hub RepoW2 PPLC4 PPLArcCArcEBoolQPiQAWinogrande
Llama 2 7Brelaxml/Llama-2-7b-QTIP-2Bit5.867.730.360.660.670.760.65
Llama 2 7B chatrelaxml/Llama-2-7b-chat-QTIP-2Bit7.009.350.380.690.780.760.65
Llama 2 13Brelaxml/Llama-2-13b-QTIP-2Bit5.116.850.410.710.650.770.68
Llama 2 13B chatrelaxml/Llama-2-13b-chat-QTIP-2Bit6.048.150.430.720.820.770.69
Llama 2 70Brelaxml/Llama-2-70b-QTIP-2Bit3.705.480.480.760.760.800.75
Llama 2 70B chatrelaxml/Llama-2-70b-chat-QTIP-2Bit4.666.710.490.760.860.800.75

Other

If you found this work useful, please consider citing

@misc{tseng2024qtipquantizationtrellisesincoherence,
      title={QTIP: Quantization with Trellises and Incoherence Processing}, 
      author={Albert Tseng and Qingyao Sun and David Hou and Christopher De Sa},
      year={2024},
      eprint={2406.11235},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2406.11235}, 
}