Home

Awesome

QAQ: Quality Adaptive Quantization for LLM KV Cache

Introduction

This is the official repository of QAQ: Quality Adaptive Quantization for LLM KV Cache.

QAQ

Brief abstract

As the need for longer context grows, a significant bottleneck in model deployment emerges due to the linear expansion of the Key-Value (KV) cache with the context length. Based on three key insights, we propose the QAQ, a $\underline{\text{Q}}\text{uality}$ $\underline{\text{A}}\text{daptive}$ $\underline{\text{Q}}\text{uantization}$ scheme for LLM KV Cache. QAQ achieves nearly $10 \times$ compression of KV cache with neglectable accuracy loss.

For more details, please refer to our paper.

Environment setup

Step 1: Install dependencies

# Install from requirements.txt
pip install -r requirements.txt
# Alternatively, you can directly install all the dependent libraries
pip install numpy scipy torch transformers datasets accelerate matplotlib tqdm

Step 2: Change GPU configurations

To support multi-GPU parallel evaluation, you need to modify device_configs in src/config.py according to your GPU configuration. Each entry in this list is used for a parallel evaluator. The first element of each entry is the main GPU device, where the quantization process takes place; the second element is the maximum memory allowed for each device, which is passed to accelerate.infer_auto_device_map.

Step 3: Obtain access to LLAMA-2 weights

Accessing LLAMA-2 weights on Hugging Face needs to be granted by Meta. Please visit the Meta website and follow the instructions on the website. After that, you can customize the cache folder for model weights by modifying hf_cache_dir in src/config.py.

How to use

There are three important classes:

To run a new experiment, you need to derive the Experiment class, override the quantizer_list and process_result functions, and finally call the run function of your derived class in the entry point. There are some sample experiments in the src/experiments folder that are used in our paper.

Citation

If you use this codebase, or QAQ inspires your work, please cite:

@misc{dong2024qaq,
      title={QAQ: Quality Adaptive Quantization for LLM KV Cache}, 
      author={Shichen Dong and Wen Cheng and Jiayu Qin and Wei Wang},
      year={2024},
      eprint={2403.04643},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}