Home

Awesome

CALDERA (Calibration Aware Low-Precision DEcomposition with Low-Rank Adaptation)

CALDERA is a post-training compression method that represents the weights of LLM matrices via a low-rank, low-precision decomposition $\mathbf{W} \approx \mathbf{Q} + \mathbf{L} \mathbf{R}$, where $\mathbf{L}$ and $\mathbf{R}$ are low-rank factors and $\mathbf{Q}, \mathbf{L}$ and $\mathbf{R}$ are all quantized to low-precision formats. By formatting this decomposition as an optimization problem and solving it via alternating minimization, CALDERA outperforms existing compression techniques in the regine of less than 2.5 bits per parameter.

To enhance performance on specific tasks, CALDERA also supports LoRA (Hu et al, 2021) fine-tuning of a portion of the low-rank factors.

Setup Instructions

Note on Python, CUDA, and PyTorch Versions

These setup instructions have been tested on Python 3.10 and 3.11, CUDA 12.1 and 12.2, and PyTorch 2.2.

In particular, the package fast-hadamard-transform lacks wheels for newer versions of these dependencies; the available wheels can be found here (the wheel filenames are of the form fast_hadamard_transform-1.0.4.post1+cu<CUDA_VERSION>torch<PYTORCH_VERSION>xx11abiTRUE-cp<PYTHON_VERSION>-cp<PYTHON_VERSION>-linux_x86_64.whl).

Instructions

  1. Install caldera as a submodule (named caldera). From the home directory of this repository, run
pip install .

This will automatically install all dependencies, except fast-hadamard-transform, which has dependency issues.

  1. Setup the QuIP# (Tseng et al, 2024) submodule:
./setup_quip_sharp.sh

This script first sets up the QuIP# Python library, and then builds the quiptools CUDA library, which provides dequantization kernels for inference.

QuIP# is used for the quantization of the $\mathbf{Q}$ matrix, and also provides useful subroutines for Hessian computation.

  1. Install fast-hadamard-transform: pip install fast-hadamard-transform.

Note: If you get the error package 'wheel' is not installed, you can install it using pip install wheel.

Repo Structure

src/caldera

This folder contains the bulk of the code for CALDERA. Via step 1 above, everything in this folder is contained in the editable python package caldera.

src/caldera/utils: utils for CALDERA. Some relevant utils files are listed below:

src/caldera/decomposition: code for the CALDERA decomposition algorithm, as well as its application to transformer layers.

scripts

This folder contains python scripts for running zero-shot, perplexity, and finetuning experiments.

Parameters for all of these scripts are specified via command-line arguments.

shell_scripts

These are Bash scripts for running experiments in the scripts folder with some reasonable parameters.

Each shell script has variables at the top specifying, e.g., directories in which to save script outputs. Make sure you set those variables as appropriate.

Note: all shell scripts are meant to be run from the root directory of this repo, i.e., ./shell_scripts/run_eval_ppl.py instead of cd shell_scripts && ./run_eval_ppl.py.

quip_sharp

This is the quip-sharp submodule, which is initialized in step 2 of the setup instructions.

Example Experiment Workflow

Note: edit each script before running it to make sure that the parameters are what you want.

  1. Compute the Hessians using ./shell_scripts/run_save_hessians.sh, which will store Hessian matrices for each layer to files.

  2. Quantize the full model using shell_scripts/run_quantize_save_caldera.sh. This stores each quantized transformer layer. The quantized model can later be loaded in using the load_quantized_model function in scripts/quantize_save_llama.py.

  3. Run zero-shot/perplexity experiments using shell_scripts/run_eval_zeroshot.sh or shell_scripts/run_eval_ppl.sh.

  4. Finetune using, e.g., shell_scripts/run_finetune_wikitext.sh