Home

Awesome

TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction

<img src="./imgs/tesseraq.png" alt="llmc" style="zoom:35%;" />

arXiv

TesseraQ is a block reconstruction-based PTQ algorithm for Large Language Models, achieving state-of-the-art uniform quantization performance under INT2/INT3/INT4 format.

News

Highlight Features

Usage

Our method has also been integrated into the official release of LLMC, feel free to use our method there!

  1. Clone this repository and install packages:

    # install packages
    cd llmc
    pip install -r requirements.txt
    
  2. Prepare models and data.

    # After downloading LLMs from huggingface, prepare calibration and evaluation data as follows:
    cd tools
    python download_calib_dataset.py --save_path [calib data path]
    python download_eval_dataset.py --save_path [eval data path] 
    
  3. Choose a model and quantize it with TesseraQ:

    # Here's an example about LLaMA-2-7B model with W2A16g128 quantization:
    cd scripts
    # Modify the path of llmc, ``llmc_path``, in the bash file. You can also choose one config 
    # placed in ``llmc/configs/quantization/Awq/`` to quantize your model, or your own
    # config referring to those we provide by changing the ``--config`` argument in run_awq_llama.sh.
    bash run_awq_llama.sh
    bash run_tesseraq_llama.sh
    

Running Scripts

We provide the running scripts to reproduce our experiments.

LLaMA-2 with Perplexity Evaluation

cd scripts
sh run_llama2.sh
Model7B13B70B
W2A168.056.555.26
W2A16g1286.825.924.73
W2A16g646.675.814.60
W3A165.845.163.68
W3A16g1285.715.113.61
W4A165.564.963.40

(Note that the above srcipts can also be used to reproduce LLaMA-7B/13B/30B/66B models)

LLaMA-3.1 with Downstream Tasks Evaluation

cd scripts
sh run_llama3_1.sh
Model8B70B
W2A16g12859.3766.76
W3A16g12867.3674.09

LLaMA-3.2 for Edge Device

cd scripts
sh run_llama3_2.sh
ModelMethodBitWiki ppl.Avg. AccScripts
LLaMA-3.2-1BPretrainFP169.7556.50-
LLaMA-3.2-1BAWQW2g128547535.42here
LLaMA-3.2-1BTesseraQW2g12818.6143.36here
LLaMA-3.2-1BAWQW3g12816.6949.85here
LLaMA-3.2-1BTesseraQW3g12811.0853.24here
LLaMA-3.2-1BAWQW4g12810.8554.68here
LLaMA-3.2-1BTesseraQW4g12810.0954.98here
LLaMA-3.2-3BPretrainFP167.8163.57-
LLaMA-3.2-3BAWQW2g128495.238.15here
LLaMA-3.2-3BTesseraQW2g12811.9451.53here
LLaMA-3.2-3BAWQW3g12810.2159.94here
LLaMA-3.2-3BTesseraQW3g1288.4561.58here
LLaMA-3.2-3BAWQW4g1288.2562.83here
LLaMA-3.2-3BTesseraQW4g1287.9663.63here

Quantized Checkpoint

Configuration

To help users design their configs, we now explain some universal configurations in all configs we provide under llmc/configs/:

Calibration Pipeline

AWQ

There are two ways to apply AWQ initialization for TesseraQ. The first one is saving the AWQ transformation/scales and then apply them on the fly before TesseraQ calibration in each block. The second method is to directly save the transformed LLM checkpoint and reload it for TesseraQ.

For the first method, please set the save_scale and save_clip to True and specify the saved path for them in the AWQ configuration, for example:

        save_scale: True
        clip_version: v2
        scale_path: ../cache/activations/L2_7b/awq_w2g128
        save_clip: True
        clip_path: ../cache/activations/L2_7b/awq_w2g128

Then, for the TesseraQ configuration, enable load transform and weight_clip and specify the saved path of clips/scales.

        weight_clip: True
        load_transform: True
        clip_version: v2

Note that when clip_version=v2, the calib_algo of weight quantization should be set to learanable. If we choose clip_version=v1, TesseraQ will perform AWQ weight clipping on-the-fly instead of load the saved clips, and may achieve better perplexity results in low bit case.

For the second method, be aware to use clip_version=v1 and then simply enable the save_transformed in AWQ configuration. Next, we change the path model/path in TesseraQ configurations without enabling load_transform or weight_clip.

OmniQuant

Since OmniQuant only optimizes the clipping values in weights for LLaMA weight only quantization. It's easy to use their pretrained values. First, download the pretrained OmniQuant clips here and then specify the parameters in TesseraQ configuration

        weight_clip: True
        load_transform: False      # no scale transformation in OmniQuant-LLaMA
        clip_version: v2
        clip_path: ../cache/activations/L2_7b/omniq_w2

Note that in most cases we observe AWQ initialization is better than OmniQuant except for W2A16 per-channel quantization.

QuaRot

We recommend saving the QuaRot checkpoint and reload it for TesseraQ quantization since QuaRot transformation can be done once and used for all bitwidth settings. To do so, simply enable the save_transformed in QuaRot configuration. Then load the saved checkpoint for TesseraQ and enable online_rotate as well as fp32_had in the configuration.

Citation

If you find our TesseraQ paper useful or relevant to your research, please kindly cite our paper:

@misc{li2024tesseraq,
      title={TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction},
      author={Yuhang Li and Priyadarshini Panda},
      year={2024},
      eprint={2410.19103},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Also consider cite the LLMC framework paper

@misc{gong2024llmcbenchmarkinglargelanguage,
      title={LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit},
      author={Ruihao Gong and Yang Yong and Shiqiao Gu and Yushi Huang and Chentao Lv and Yunchen Zhang and Xianglong Liu and Dacheng Tao},
      year={2024},
      eprint={2405.06001},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2405.06001},
}