Home

Awesome

AQLM

Official PyTorch implementation for Extreme Compression of Large Language Models via Additive Quantization

[2024.05] AQLM was accepted to ICML'2024! If you're attending, meet us around this poster.

[2024.06] We released a new paper that extends AQLM with new finetuning algorithm called PV-tuning. We're also releasing PV-tuned AQLM models in this collection

[2024.08] We have merged the PV-Tuning branch into the main branch. To reproduce results with old finetuning (before Aug 21), use commit 559a366.

Inference

Demo

Learn how to run the prequantized models using this Google Colab examples:

Basic AQLM <br> generationStreaming with <br> GPU/CPUInference with CUDA <br> graphs (3x speedup)Fine-tuning <br> with PEFTServing with <br> vLLM
<a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/colab_example.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="AQLM In Colab"/></a><a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/streaming_example.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="AQLM In Colab"/></a><a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_cuda_graph.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_2bit_training.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_vllm.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Models

This repository is currently designed to work with models of LLaMA, Mistral and Mixtral families. The models reported below use full model fine-tuning as described in appendix A, with cross-entropy objective with teacher logits.

We provide a number of prequantized AQLM models without PV-Tuning (scroll down for PV-Tuned models):

ModelAQLM schemeWikiText-2 PPLMMLU (5-shot) FP16→AQLMModel size, GbHub link
Llama-3-8b1x16-0.65→0.564.1Link
Llama-3-8b-Instruct1x16-0.66→0.594.1Link
Llama-3-70b1x16-0.79→0.7521.9Link
Llama-3-70b-Instruct1x16-0.80→0.7621.9Link
Command-R1x16-0.68→0.5712.7Link
Command-R+1x16-0.74→0.6831.9Link
Mistral-7b1x165.40-2.5Link
Mistral-7B-Instruct-v0.22x8-0.59→0.442.5Link
Mixtral-8x7b1x163.35-12.6Link
Mixtral-8x7b-Instruct1x16--12.6Link
Llama-2-7b1x165.920.46→0.392.4Link
Llama-2-7b2x86.69-2.2Link
Llama-2-7b8x86.61-2.2Link
Llama-2-13b1x165.220.55→0.494.1Link
Llama-2-13b2x85.63-3.8Link
Llama-2-70b1x163.830.69→0.6518.8Link
Llama-2-70b2x84.21-18.2Link
gemma-2b1x16--1.7Link
gemma-2b2x8--1.6Link

You can also download AQLM models tuned via PV-tuning:

ModelAQLM schemeWikiText-2 PPLModel size, GbHub link
Llama-2-7b1x16g85.682.4Link
Llama-2-7b2x8g85.902.2Link
Llama-2-7b1x16g169.211.7Link
Llama-2-13b1x16g85.054.1Link
Llama-2-70b1x16g83.7818.8Link
Meta-Llama-3-8B1x16g86.994.1Link
Meta-Llama-3-8B1x16g169.433.9Link
Meta-Llama-3-70B1x16g84.5721.9Link
Meta-Llama-3-70B1x16g168.6713Link
Mistral-7B-v0.11x16g85.222.51Link
Phi-3-mini-4k-instruct1x16g86.631.4Link

Note that models with "g16" in their scheme require aqlm inference library v1.1.6 or newer:

pip install aqlm[gpu,cpu]>=1.1.6

Above perplexity is evaluated on 4k context length for Llama 2 models and 8k for Mistral/Mixtral and Llama 3. Please also note that token-level perplexity can only be compared within the same model family, but should not be compared between models that use different vocabularies. While Mistral has a lower perplexity than Llama 3 8B but this does not mean that Mistral is better: Llama's perplexity is computed on a much larger dictionary and has higher per-token perplexity because of that.

For more evaluation results and detailed explanations, please see our papers: Egiazarian et al. (2024) for pure AQLM and Malinovskii et al. (2024) for PV-Tuned models.

Inference kernels

AQLM quantization setpus vary mainly on the number of codebooks used as well as the codebook sizes in bits. The most popular setups, as well as inference kernels they support are:

KernelNumber of codebooksCodebook size, bitsScheme NotationAccuracySpeedupFast GPU inferenceFast CPU inference
TritonKNKxN-Up to ~0.7x
CUDA1161x16BestUp to ~1.3x
CUDA282x8OKUp to ~3.0x
NumbaK8Kx8GoodUp to ~4.0x

Installation

To run the models, one would have to install an inference library:

pip install aqlm[gpu,cpu]

, specifying either gpu, cpu or both based on one's inference setting.

Then, one can use the familiar .from_pretrained method provided by the transformers library:

from transformers import AutoModelForCausalLM

quantized_model = AutoModelForCausalLM.from_pretrained(
    "ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf",
    trust_remote_code=True, torch_dtype="auto"
).cuda()

Notice that torch_dtype should be set to either torch.float16 or "auto" on GPU and torch.float32 on CPU. After that, the model can be used exactly the same as one would use and unquantized model.

Quantization

Dependencies

Install packages from requirements.txt:

pip install -r requirements.txt

Loading / caching datasets and tokenizer

The script will require downloading and caching locally the relevant tokenizer and the datasets. They will be saved in default Huggingface Datasets directory unless alternative location is provided by env variables. See relevant Datasets documentation section

Data

When quantizing models with AQLM, we recommend that you use a subset of the original data the model was trained on.

For Llama-2 models, the closest available dataset is RedPajama . To load subset of RedPajama provide "pajama" in --dataset argument. This will process nsamples data and tokenize it using provided model tokenizer.

Additionally we provide tokenized Redpajama for LLama and Solar/Mistral models for 4096 context lengths stored in Hunggingface . To load it, use:

from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="Vahe1994/AQLM", filename="data/name.pth", repo_type="dataset")

To use downloaded data from HF, place it in data folder(optional) and set correct path to it in "--dataset" argument in main.py.

Warning: These subsets are already processed with the corresponding model tokenizer. If you want to quantize another model (e.g. mistral/mixtral), please re-tokenize the data with provided script in src/datautils.

WandB logging

One can optionally log the data to Weights and Biases service (wandb). Run pip install wandb for W&B logging. Specify $WANDB_ENTITY, $WANDB_PROJECT, $WANDB_NAME environment variables prior to running experiments. use --wandb argument to enable logging

GPU and RAM requirements

This code was developed and tested using a several A100 GPU with 80GB GPU RAM. You can use the --offload activations option to reduce VRAM usage. For Language Model Evaluation Harness evaluation one needs to have enough memory to load whole model + activation tensors on one or several devices.

Quantization time

AQLM quantization takes considerably longer to calibrate than simpler quantization methods such as GPTQ. This only impacts quantization time, not inference time.

For instance, quantizing a 7B model with default configuration takes about 1 day on a single A100 gpu. Similarly, quantizing a 70B model on a single GPU would take 10-14 days. If you have multiple GPUs with fast interconnect, you can run AQLM multi-gpu to speed up comparison - simply set CUDA_VISIBLE_DEVICES for multiple GPUs. Quantizing 7B model on two gpus reduces quantization time to ~14.5 hours. Similarly, quantizing a 70B model on 8 x A100 GPUs takes 3 days 18 hours.

If you need to speed up quantization without adding more GPUs, you may also increase --relative_mse_tolerance or set --init_max_points_per_centroid or limit --finetune_max_epochs. However, that usually comes at a cost of reduced model accuracy.

Model downloading

The code requires the LLaMA model to be downloaded in Huggingface format and saved locally. The scripts below assume that $TRANSFORMERS_CACHE variable points to the Huggingface Transformers cache folder. To download and cache the models, run this in the same environment:

from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "meta-llama/Llama-2-7b-hf"  # or whatever else you wish to download
tokenizer = AutoTokenizer.from_pretrained(model_name, torch_dtype="auto")
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")

How to quantize a model with AQLM

This script compresses the model and then tests its performance in terms of perplexity using WikiText2, C4, and Penn Treebank datasets.

The command to launch the script should look like this:

export CUDA_VISIBLE_DEVICES=0   # or e.g. 0,1,2,3
export MODEL_PATH=<PATH_TO_MODEL_ON_HUB>
export DATASET_PATH=<INSERT DATASET NAME OR PATH TO CUSTOM DATA>
export SAVE_PATH=/path/to/save/quantized/model/
export WANDB_PROJECT=MY_AQ_EXPS
export WANDB_NAME=COOL_EXP_NAME

python main.py $MODEL_PATH $DATASET_PATH \
 --nsamples=1024 \
 --val_size=128 \
 --num_codebooks=1 \
 --nbits_per_codebook=16 \
 --in_group_size=8 \
 --relative_mse_tolerance=0.01 \
 --finetune_batch_size=32 \
 --finetune_max_epochs=10 \
 --finetune_early_stop=3 \
 --finetune_keep_best \
 --local_batch_size=1 \
 --offload_activations \
 --wandb \
 --resume \
 --save $SAVE_PATH

Main CLI arguments:

There are additional hyperparameters aviailable. Run python main.py --help for more details on command line arguments, including compression parameters.

Finetuning

Note to reproduce results with old finetuning (before Aug 21), use commit 559a366. Old version of finetuning produced worse results than new one even without PV-tuning, but was faster.

The accuracy of the quantized model can be further improved via finetuning.

To use our new PV-Tuning algorithm, the command to launch the script should look like this:

torchrun --nproc-per-node=$NUM_GPUS finetune.py \
    --base_model $MODEL_PATH \
    --quantized_model $QUANTIZED_WEIGHTS_PATH \
    --model_seqlen=$SEQLEN \
    --block_type LlamaDecoderLayer \
    --load_dtype bfloat16 \
    --amp_dtype bfloat16 \
    --code_dtype uint16 \
    --dataset_name=pajama \
    --split none \
    --seed 42 \
    --preprocessing_chunk_length 100000 \
    --cache_dir=$CACHE_DIR \
    --trust_remote_code \
    --update_codes \
    --update_codebooks_and_scales \
    --update_non_quantized_parameters \
    --lamb \
    --debias \
    --lr 3e-4 \
    --adam_beta1 0.90 \
    --adam_beta2 0.95 \
    --max_code_change_per_step 1e-2 \
    --code_lr 1e-2 \
    --code_beta1 0.0 \
    --code_beta2 0.95 \
    --beam_size 5 \
    --delta_decay 0 \
    --batch_size=128 \
    --microbatch_size=1 \
    --max_epochs 1 \
    --gradient_checkpointing \
    --print_every_steps=1 \
    --verbose_optimizer \
    --wandb \
    --eval_every_steps=10 \
    --keep_best_model \
    --save $SAVE_PATH \
    --save_every_steps 100 \
    --attn_implementation flash_attention_2

Zero-shot benchmarks via LM Evaluation Harness

To perform zero-shot evaluation, we adopt Language Model Evaluation Harness framework. Our code works with models in standard transformers`` format and may (optionally) load the weights of a quantized model via --aqlm_checkpoint_path` argument.

The evalution results in PV-Tuning were produced with lm-eval=0.4.0.

To run evaluation make sure that proper version is installed or install it via: pip install lm-eval==0.4.0.

The main script for launching the evaluation procedure is lmeval.py.

export CUDA_VISIBLE_DEVICES=0,1,2,3  # optional: select GPUs
export QUANTIZED_MODEL=<PATH_TO_SAVED_QUANTIZED_MODEL_FROM_MAIN.py>
export MODEL_PATH=<INSERT_PATH_TO_ORIINAL_MODEL_ON_HUB>
export DATASET=<INSERT DATASET NAME OR PATH TO CUSTOM DATA>
export WANDB_PROJECT=MY_AQLM_EVAL
export WANDB_NAME=COOL_EVAL_NAME

# for 0-shot evals
python lmeval.py \
    --model hf \
    --model_args pretrained=$MODEL_PATH,dtype=float16,parallelize=True \
    --tasks winogrande,piqa,hellaswag,arc_easy,arc_challenge \
    --batch_size <EVAL_BATCH_SIZE> \
    --aqlm_checkpoint_path QUANTIZED_MODEL # if evaluating quantized model

# for 5-shot MMLU
python lmeval.py \
    --model hf \
    --model_args pretrained=$MODEL_PATH,dtype=float16,parallelize=True \
    --tasks mmlu \
    --batch_size <EVAL_BATCH_SIZE> \
    --num_fewshot 5 \
    --aqlm_checkpoint_path QUANTIZED_MODEL # if evaluating quantized model

Preparing models for inference

To convert a model into a Hugging Face compatible format, use convert_to_hf.py model in_path out_path with corresponding arguments:

You may also specify flags such as --save_safetensors to control the saved model format (see --help for details).

Example command: python convert_to_hf.py meta-llama/Llama-2-7b-hf ./path/to/saved/quantization ./converted-llama2-7b-hf --save_safetensors

Instructions for QuIP# finetuning

Instructions for QuIP# finetuning can be found here.

Contributing

If you want to contribute something substantial (more than a typo), please open an issue first. We use black and isort for all pull requests. Before committing your code run black . && isort .

Cite

If you found this work useful, please consider citing:

@misc{egiazarian2024extreme,
      title={Extreme Compression of Large Language Models via Additive Quantization}, 
      author={Vage Egiazarian and Andrei Panferov and Denis Kuznedelev and Elias Frantar and Artem Babenko and Dan Alistarh},
      year={2024},
      eprint={2401.06118},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
@misc{malinovskii2024pvtuning,
      title={PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression}, 
      author={Vladimir Malinovskii and Denis Mazur and Ivan Ilin and Denis Kuznedelev and Konstantin Burlachenko and Kai Yi and Dan Alistarh and Peter Richtarik},
      year={2024},
      eprint={2405.14852},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}