Home

Awesome

IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact

This repository contains the PyTorch implementation of IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact.

IntactKV is a simple and orthogonal method to enhance the quantized LLMs. It can be feasibly combined with various existing quantization approaches (e.g., AWQ, OmniQuant, GPTQ, QuaRot) with no inference overhead on various LLMs (LLaMA, Vicuna, OPT, Mistral e.t.c.). IntactKV is built on a valuable observation that pivot tokens do exist in current LLMs with massive values and highly concentrated attention scores, and thay are critical to the performance of quantized LLMs. A concurrent work Massive Activations also discovers such tokens and provides more detailed studies on this phenomenon.

<img src="figs/overview.png"/>

Preparations

Installation

conda create -n intactkv python=3.10 -y
conda activate intactkv
pip install -r requirements.txt

Data Preparation

Download datasets in ./datasets.

Calibration set or PPL evaluation

DatasetLocal DirURL
WikiText2./datasets/wikitexthttps://huggingface.co/datasets/wikitext
PTB./datasets/ptb_text_onlyhttps://huggingface.co/datasets/ptb_text_only
C4./datasets/allenai/c4https://huggingface.co/datasets/allenai/c4
Pile./datasets/pile-val-backuphttps://huggingface.co/datasets/mit-han-lab/pile-val-backup
ShareGPT./datasets/ShareGPT_Vicuna_unfilteredhttps://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered

MMLU evaluation

DatasetLocal DirURL
MMLU./datasets/mmlu/datahttps://people.eecs.berkeley.edu/~hendrycks/data.tar

Commonsense QA evaluation

DatasetLocal DirURL
OBQA./datasets/openbookqahttps://huggingface.co/datasets/openbookqa
WinoGrande./datasets/winograndehttps://huggingface.co/datasets/winogrande
ARC-E and ARC-C./datasets/ai2_archttps://huggingface.co/datasets/ai2_arc
BoolQ./datasets/super_gluehttps://huggingface.co/datasets/super_glue
HellaSwag./datasets/hellaswaghttps://huggingface.co/datasets/hellaswag
LAMBADA./datasets/lambada_openaihttps://huggingface.co/datasets/EleutherAI/lambada_openai

Model Preparation

Download models in ./modelzoo.

ModelLocal DirURL
LLaMA-2-7B./modelzoo/llama-2/llama-2-7bhttps://huggingface.co/meta-llama/Llama-2-7b
LLaMA-2-13B./modelzoo/llama-2/llama-2-13bhttps://huggingface.co/meta-llama/Llama-2-13b
LLaMA-2-70B./modelzoo/llama-2/llama-2-70bhttps://huggingface.co/meta-llama/Llama-2-70b
LLaMA-3-8B./modelzoo/llama-3/llama-3-8bhttps://huggingface.co/meta-llama/Meta-Llama-3-8B
LLaMA-3-70B./modelzoo/llama-3/llama-3-70bhttps://huggingface.co/meta-llama/Meta-Llama-3-70B
Vicuna-v1.3-7B./modelzoo/vicuna-v1.3/vicuna-v1.3-7bhttps://huggingface.co/lmsys/vicuna-7b-v1.3
Vicuna-v1.3-13B./modelzoo/vicuna-v1.3/vicuna-v1.3-13bhttps://huggingface.co/lmsys/vicuna-13b-v1.3
Vicuna-v1.3-33B./modelzoo/vicuna-v1.3/vicuna-v1.3-33bhttps://huggingface.co/lmsys/vicuna-33b-v1.3
Vicuna-v1.5-7B./modelzoo/vicuna-v1.5/vicuna-v1.5-7bhttps://huggingface.co/lmsys/vicuna-7b-v1.5
Vicuna-v1.5-13B./modelzoo/vicuna-v1.5/vicuna-v1.5-13bhttps://huggingface.co/lmsys/vicuna-13b-v1.5

Weight-only Quantization

Model Quantization

GPTQ Quantize model with AutoGPTQ. The quantized model will be available in ./modelzoo/autogptq.

# w3g128 quantization of Vicuna-v1.5-7B on GPU 0
bash ./scripts/quantization/auto_gptq.sh vicuna-v1.5 vicuna-v1.5-7b 3 128 0

AWQ Download pre-computed AWQ parameters from AWQ modelzoo, or reproduce with the following script. The search results will be saved in ./modelzoo/llm-awq.

# w3g128 quantization of Vicuna-v1.5-7B on GPU 0
bash ./scripts/quantization/llm_awq.sh vicuna-v1.5 7b 3 128 0

IntactKV_[B]

Evaluation

IntactKV_[B] can be directly integrated with various quantization methods (e.g., AWQ, GPTQ, RTN) without training or inference overhead, and can be evaluated on PPL, MMLU, and QA tasks, where [BOS] token is prepended to the inputs.

PPL

# evaluate w3g128 AWQ-quantized LLaMA-2-7B model on GPU0, port 29500
bash ./scripts/eval/eval.sh llama-2 7b awq 3 16 ppl 29500 0

MMLU

# evaluate w3g128 AWQ-quantized Vicuna-v1.5-7B model on GPU0, port 29500
bash ./scripts/eval/eval.sh vicuna-v1.5 7b awq 3 16 mmlu 29500 0

Commonsense QA

# evaluate w3g128 AWQ-quantized Vicuna-v1.5-7B model on GPU0, port 29500
bash ./scripts/eval/eval.sh vicuna-v1.5 7b awq 3 16 qa 29500 0

IntactKV_[P]

IntactKV as Trainable Parameters

IntactKV can be optionally calibrated on a calibration set of size 128 to compensate for the quantization error.

# calibrate IntactKV of w3g128 AWQ-quantized Vicuna-v1.5-7B model on GPU0
bash ./scripts/train/train.sh vicuna-v1.5 7b awq 3 0

Evaluation

MT-bench

  1. Generate answers to MT-bench with the following script.
# generate answers to MT-bench for w3g128 AWQ-quantized Vicuna-v1.5-7B model on GPU0
bash scripts/eval/gen_mtbench_answer.sh vicuna-v1.5 7b awq 3 0
  1. Score the answers with GPT4 using LLM Judge. Reference answer of gpt-4-0125-preview can be found in ./fastchat/data/mt_bench/reference_answer.

Weight and Activation Quantization

We integrate IntactKV with a SOTA INT4 weight and activation quantization method QuaRot, which uses hadamard transformation to alleviate outlier issues. Run the following script to obtain PPL evaluation results.

# LLaMA-2-7B model on GPU0
bash ./scripts/eval/quarot.sh llama-2 7b 0

KV Cache Quantization

We implement a simple asymmetric per-head dynamic quantization strategy for KV cache. Run the following scripts to obtain PPL/MMLU/QA evaluation results.

IntactKV is also available in another SOTA KV cache only quantization method KVQuant, and can be evaluated with KVQuant's official code.

PPL

# evaluate w3g128kv4 AWQ-quantized LLaMA-2-7B model on GPU0, port 29500
bash ./scripts/eval/eval.sh llama-2 7b awq 3 4 ppl 29500 0

MMLU

# evaluate w3g128kv4 AWQ-quantized Vicuna-v1.5-7B model on GPU0, port 29500
bash ./scripts/eval/eval.sh vicuna-v1.5 7b awq 3 4 mmlu 29500 0

Commonsense QA

# evaluate w3g128kv4 AWQ-quantized Vicuna-v1.5-7B model on GPU0, port 29500
bash ./scripts/eval/eval.sh vicuna-v1.5 7b awq 3 4 qa 29500 0

Visualizations

Visualizations of pivot tokens Visualize output activations and corresponding attention maps of LLMs. Output PDFs will be saved in ./outputs/visualizations.

# LLaMA-2-7B model on GPU0
bash ./scripts/visualization/plot_act.sh llama-2 7b 0
<img src="figs/act_attn.png" style="zoom:50%" />

Plot quantization loss w.r.t. IntactKV size Plot the line chart in Figure 2, which strongly demonstrates the importance of pivot tokens.

# LLaMA-2-7B model on GPU0
bash ./scripts/visualization/motivation.sh llama-2 7b 0
<img src="figs/motivation.png" style="zoom:50%" />

Results

Table1. INT3-group128 weight-only quantization results of LLaMA and LLaMA-2 Models on C4 dataset.

MethodLLaMA-7BLLaMA-13BLLaMA-30BLLaMA-65BLLaMA-2-7BLLaMA-2-13BLLaMA-2-70B
FP167.366.826.155.837.286.755.73
RTN9.157.896.856.338.977.606.27
+IntactKV_[B]8.527.666.696.208.617.486.13
GPTQ8.597.496.736.299.587.436.33
+IntactKV_[B]8.307.426.626.239.277.366.28
OmniQuant8.267.396.656.188.357.436.12
+IntactKV_[B]8.257.396.646.188.337.406.11
AWQ8.267.386.596.168.317.326.05
+IntactKV_[B]8.127.366.546.128.187.296.04

Table 2. INT3-group128 weight-only quantization results of Vicuna models on 5-shot MMLU tasks.

Vicuna Familyv1.5-7Bv1.5-13Bv1.3-7Bv1.3-13Bv1.3-33B
FP1649.84%55.78%47.12%52.10%59.30%
RTN44.62%51.44%39.33%44.56%53.18%
+IntactKV_[B]45.93%51.89%41.74%46.73%55.20%
GPTQ43.99%52.95%40.12%47.83%55.84%
+IntactKV_[B]44.86%52.49%41.55%48.53%56.32%
OmniQuant46.62%52.82%42.95%48.23%55.21%
+IntactKV_[B]46.27%52.67%43.85%48.31%55.51%
AWQ46.45%52.92%43.08%48.56%56.09%
+IntactKV_[B]46.87%53.58%44.67%49.05%56.91%

Table 3. INT3-group128 weight-only quantization results of Vicuna models on 0-shot QA tasks.

Vicuna Familyv1.5-7Bv1.5-13Bv1.3-7Bv1.3-13Bv1.3-33B
FP1665.33%68.38%64.52%67.22%69.53%
RTN61.36%66.12%59.05%63.43%67.33%
+IntactKV_[B]61.94%65.91%61.26%63.94%67.95%
GPTQ58.61%66.34%59.56%65.11%66.66%
+IntactKV_[B]59.12%66.53%60.46%65.13%67.93%
OmniQuant62.30%65.58%60.89%64.62%67.61%
+IntactKV_[B]62.01%65.67%60.66%64.89%67.61%
AWQ62.18%66.51%60.75%64.56%67.67%
+IntactKV_[B]62.49%66.93%61.93%65.02%67.90%

Table 4. GPT-4 evaluation of INT3-group128 weight-only quantized Vicuna-v1.5 models on MT-Bench. The scores are on a scale of 10.

MethodVicuna-v1.5-7BVicuna-v1.5-13B
FP165.315.52
RTN4.345.13
+IntactKV_[P]4.725.27
+IntactKV_[P]+Cal4.735.30
OmniQuant4.785.05
+IntactKV_[P]4.945.10
+IntactKV_[P]+Cal4.855.24
AWQ4.745.17
+IntactKV_[P]4.685.34
+IntactKV_[P]+Cal4.845.44

Table 5. INT4 weight and activation quantization results of LLaMA models on C4 dataset.

LLaMA-7BLLaMA-13BLLaMA-2-7BLLaMA-2-13BLLaMA-3-8B
FP167.366.827.286.759.48
OmniQuant17.0315.6521.416.24-
+IntactKV_[B]16.2413.8720.0115.91-
QuaRot8.237.48.37.5113.42
+IntactKV_[B]8.057.328.127.2512.23

Reference

If you find IntactKV helpful, please cite our paper:

@inproceedings{liu2024intactkv,
  title={IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact},
  author={Liu, Ruikang and Bai, Haoli and Lin, Haokun and Li, Yuening and Gao, Han and Xu, Zhengzhuo and Hou, Lu and Yao, Jun and Yuan, Chun},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2024},
  year={2024}
}