Home

Awesome

vqllm

Residual vector quantization for KV cache compression in large language model

Setup

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash

conda create -n vqllm python=3.11
conda activate vqllm

git clone https://github.com/iankur/vqllm.git
cd vqllm
pip install -e .

For development env

pip install -e .[dev]
pre-commit install

Log into huggingface and wandb to download models and save results

huggingface-cli login
wandb login

Experiment

All experiments can be launched with the following commands. Note that VQ size and type ablations use llama3 model whereas model ablations uses all the models downloaded below. See more details about the models here.

tune download meta-llama/Meta-Llama-3-8B --output-dir recipes/ckpts/llama3_8b
tune download mistralai/Mistral-7B-v0.1 --output-dir recipes/ckpts/mistral_7b
tune download google/gemma-7b --output-dir recipes/ckpts/gemma_7b --ignore-patterns "gemma-7b.gguf"

bash recipes/run_vq_size_ablation.sh
bash recipes/run_vq_type_ablation.sh
bash recipes/run_vq_model_ablation.sh

Notes

Acknowledgements