Awesome

IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact

This repository contains the PyTorch implementation of IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact.

IntactKV is a simple and orthogonal method to enhance the quantized LLMs. It can be feasibly combined with various existing quantization approaches (e.g., AWQ, OmniQuant, GPTQ, QuaRot) with no inference overhead on various LLMs (LLaMA, Vicuna, OPT, Mistral e.t.c.). IntactKV is built on a valuable observation that pivot tokens do exist in current LLMs with massive values and highly concentrated attention scores, and thay are critical to the performance of quantized LLMs. A concurrent work Massive Activations also discovers such tokens and provides more detailed studies on this phenomenon.

Preparations

Installation

conda create -n intactkv python=3.10 -y
conda activate intactkv
pip install -r requirements.txt

Data Preparation

Download datasets in ./datasets.

Calibration set or PPL evaluation

Dataset	Local Dir	URL
WikiText2	./datasets/wikitext	https://huggingface.co/datasets/wikitext
PTB	./datasets/ptb_text_only	https://huggingface.co/datasets/ptb_text_only
C4	./datasets/allenai/c4	https://huggingface.co/datasets/allenai/c4
Pile	./datasets/pile-val-backup	https://huggingface.co/datasets/mit-han-lab/pile-val-backup
ShareGPT	./datasets/ShareGPT_Vicuna_unfiltered	https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered

MMLU evaluation

Dataset	Local Dir	URL
MMLU	./datasets/mmlu/data	https://people.eecs.berkeley.edu/~hendrycks/data.tar

Commonsense QA evaluation

Dataset	Local Dir	URL
OBQA	./datasets/openbookqa	https://huggingface.co/datasets/openbookqa
WinoGrande	./datasets/winogrande	https://huggingface.co/datasets/winogrande
ARC-E and ARC-C	./datasets/ai2_arc	https://huggingface.co/datasets/ai2_arc
BoolQ	./datasets/super_glue	https://huggingface.co/datasets/super_glue
HellaSwag	./datasets/hellaswag	https://huggingface.co/datasets/hellaswag
LAMBADA	./datasets/lambada_openai	https://huggingface.co/datasets/EleutherAI/lambada_openai

Model Preparation

Download models in ./modelzoo.

Model	Local Dir	URL
LLaMA-2-7B	./modelzoo/llama-2/llama-2-7b	https://huggingface.co/meta-llama/Llama-2-7b
LLaMA-2-13B	./modelzoo/llama-2/llama-2-13b	https://huggingface.co/meta-llama/Llama-2-13b
LLaMA-2-70B	./modelzoo/llama-2/llama-2-70b	https://huggingface.co/meta-llama/Llama-2-70b
LLaMA-3-8B	./modelzoo/llama-3/llama-3-8b	https://huggingface.co/meta-llama/Meta-Llama-3-8B
LLaMA-3-70B	./modelzoo/llama-3/llama-3-70b	https://huggingface.co/meta-llama/Meta-Llama-3-70B
Vicuna-v1.3-7B	./modelzoo/vicuna-v1.3/vicuna-v1.3-7b	https://huggingface.co/lmsys/vicuna-7b-v1.3
Vicuna-v1.3-13B	./modelzoo/vicuna-v1.3/vicuna-v1.3-13b	https://huggingface.co/lmsys/vicuna-13b-v1.3
Vicuna-v1.3-33B	./modelzoo/vicuna-v1.3/vicuna-v1.3-33b	https://huggingface.co/lmsys/vicuna-33b-v1.3
Vicuna-v1.5-7B	./modelzoo/vicuna-v1.5/vicuna-v1.5-7b	https://huggingface.co/lmsys/vicuna-7b-v1.5
Vicuna-v1.5-13B	./modelzoo/vicuna-v1.5/vicuna-v1.5-13b	https://huggingface.co/lmsys/vicuna-13b-v1.5

Weight-only Quantization

Model Quantization

GPTQ Quantize model with AutoGPTQ. The quantized model will be available in ./modelzoo/autogptq.

# w3g128 quantization of Vicuna-v1.5-7B on GPU 0
bash ./scripts/quantization/auto_gptq.sh vicuna-v1.5 vicuna-v1.5-7b 3 128 0

AWQ Download pre-computed AWQ parameters from AWQ modelzoo, or reproduce with the following script. The search results will be saved in ./modelzoo/llm-awq.

# w3g128 quantization of Vicuna-v1.5-7B on GPU 0
bash ./scripts/quantization/llm_awq.sh vicuna-v1.5 7b 3 128 0

IntactKV_[B]

Evaluation

IntactKV_[B] can be directly integrated with various quantization methods (e.g., AWQ, GPTQ, RTN) without training or inference overhead, and can be evaluated on PPL, MMLU, and QA tasks, where [BOS] token is prepended to the inputs.

PPL

# evaluate w3g128 AWQ-quantized LLaMA-2-7B model on GPU0, port 29500
bash ./scripts/eval/eval.sh llama-2 7b awq 3 16 ppl 29500 0

MMLU

# evaluate w3g128 AWQ-quantized Vicuna-v1.5-7B model on GPU0, port 29500
bash ./scripts/eval/eval.sh vicuna-v1.5 7b awq 3 16 mmlu 29500 0

Commonsense QA

# evaluate w3g128 AWQ-quantized Vicuna-v1.5-7B model on GPU0, port 29500
bash ./scripts/eval/eval.sh vicuna-v1.5 7b awq 3 16 qa 29500 0

IntactKV_[P]

IntactKV as Trainable Parameters

IntactKV can be optionally calibrated on a calibration set of size 128 to compensate for the quantization error.

# calibrate IntactKV of w3g128 AWQ-quantized Vicuna-v1.5-7B model on GPU0
bash ./scripts/train/train.sh vicuna-v1.5 7b awq 3 0

Evaluation

MT-bench

Generate answers to MT-bench with the following script.

# generate answers to MT-bench for w3g128 AWQ-quantized Vicuna-v1.5-7B model on GPU0
bash scripts/eval/gen_mtbench_answer.sh vicuna-v1.5 7b awq 3 0

Score the answers with GPT4 using LLM Judge. Reference answer of gpt-4-0125-preview can be found in ./fastchat/data/mt_bench/reference_answer.

Weight and Activation Quantization

We integrate IntactKV with a SOTA INT4 weight and activation quantization method QuaRot, which uses hadamard transformation to alleviate outlier issues. Run the following script to obtain PPL evaluation results.

# LLaMA-2-7B model on GPU0
bash ./scripts/eval/quarot.sh llama-2 7b 0

KV Cache Quantization

We implement a simple asymmetric per-head dynamic quantization strategy for KV cache. Run the following scripts to obtain PPL/MMLU/QA evaluation results.

IntactKV is also available in another SOTA KV cache only quantization method KVQuant, and can be evaluated with KVQuant's official code.

PPL

# evaluate w3g128kv4 AWQ-quantized LLaMA-2-7B model on GPU0, port 29500
bash ./scripts/eval/eval.sh llama-2 7b awq 3 4 ppl 29500 0

MMLU

# evaluate w3g128kv4 AWQ-quantized Vicuna-v1.5-7B model on GPU0, port 29500
bash ./scripts/eval/eval.sh vicuna-v1.5 7b awq 3 4 mmlu 29500 0

Commonsense QA

# evaluate w3g128kv4 AWQ-quantized Vicuna-v1.5-7B model on GPU0, port 29500
bash ./scripts/eval/eval.sh vicuna-v1.5 7b awq 3 4 qa 29500 0

Visualizations

Visualizations of pivot tokens Visualize output activations and corresponding attention maps of LLMs. Output PDFs will be saved in ./outputs/visualizations.

# LLaMA-2-7B model on GPU0
bash ./scripts/visualization/plot_act.sh llama-2 7b 0

Plot quantization loss w.r.t. IntactKV size Plot the line chart in Figure 2, which strongly demonstrates the importance of pivot tokens.

# LLaMA-2-7B model on GPU0
bash ./scripts/visualization/motivation.sh llama-2 7b 0

Results

Table1. INT3-group128 weight-only quantization results of LLaMA and LLaMA-2 Models on C4 dataset.

Method	LLaMA-7B	LLaMA-13B	LLaMA-30B	LLaMA-65B	LLaMA-2-7B	LLaMA-2-13B	LLaMA-2-70B
FP16	7.36	6.82	6.15	5.83	7.28	6.75	5.73
RTN	9.15	7.89	6.85	6.33	8.97	7.60	6.27
+IntactKV_[B]	8.52	7.66	6.69	6.20	8.61	7.48	6.13
GPTQ	8.59	7.49	6.73	6.29	9.58	7.43	6.33
+IntactKV_[B]	8.30	7.42	6.62	6.23	9.27	7.36	6.28
OmniQuant	8.26	7.39	6.65	6.18	8.35	7.43	6.12
+IntactKV_[B]	8.25	7.39	6.64	6.18	8.33	7.40	6.11
AWQ	8.26	7.38	6.59	6.16	8.31	7.32	6.05
+IntactKV_[B]	8.12	7.36	6.54	6.12	8.18	7.29	6.04

Table 2. INT3-group128 weight-only quantization results of Vicuna models on 5-shot MMLU tasks.

Vicuna Family	v1.5-7B	v1.5-13B	v1.3-7B	v1.3-13B	v1.3-33B
FP16	49.84%	55.78%	47.12%	52.10%	59.30%
RTN	44.62%	51.44%	39.33%	44.56%	53.18%
+IntactKV_[B]	45.93%	51.89%	41.74%	46.73%	55.20%
GPTQ	43.99%	52.95%	40.12%	47.83%	55.84%
+IntactKV_[B]	44.86%	52.49%	41.55%	48.53%	56.32%
OmniQuant	46.62%	52.82%	42.95%	48.23%	55.21%
+IntactKV_[B]	46.27%	52.67%	43.85%	48.31%	55.51%
AWQ	46.45%	52.92%	43.08%	48.56%	56.09%
+IntactKV_[B]	46.87%	53.58%	44.67%	49.05%	56.91%

Table 3. INT3-group128 weight-only quantization results of Vicuna models on 0-shot QA tasks.

Vicuna Family	v1.5-7B	v1.5-13B	v1.3-7B	v1.3-13B	v1.3-33B
FP16	65.33%	68.38%	64.52%	67.22%	69.53%
RTN	61.36%	66.12%	59.05%	63.43%	67.33%
+IntactKV_[B]	61.94%	65.91%	61.26%	63.94%	67.95%
GPTQ	58.61%	66.34%	59.56%	65.11%	66.66%
+IntactKV_[B]	59.12%	66.53%	60.46%	65.13%	67.93%
OmniQuant	62.30%	65.58%	60.89%	64.62%	67.61%
+IntactKV_[B]	62.01%	65.67%	60.66%	64.89%	67.61%
AWQ	62.18%	66.51%	60.75%	64.56%	67.67%
+IntactKV_[B]	62.49%	66.93%	61.93%	65.02%	67.90%

Table 4. GPT-4 evaluation of INT3-group128 weight-only quantized Vicuna-v1.5 models on MT-Bench. The scores are on a scale of 10.

Method	Vicuna-v1.5-7B	Vicuna-v1.5-13B
FP16	5.31	5.52
RTN	4.34	5.13
+IntactKV_[P]	4.72	5.27
+IntactKV_[P]+Cal	4.73	5.30
OmniQuant	4.78	5.05
+IntactKV_[P]	4.94	5.10
+IntactKV_[P]+Cal	4.85	5.24
AWQ	4.74	5.17
+IntactKV_[P]	4.68	5.34
+IntactKV_[P]+Cal	4.84	5.44

Table 5. INT4 weight and activation quantization results of LLaMA models on C4 dataset.

	LLaMA-7B	LLaMA-13B	LLaMA-2-7B	LLaMA-2-13B	LLaMA-3-8B
FP16	7.36	6.82	7.28	6.75	9.48
OmniQuant	17.03	15.65	21.4	16.24	-
+IntactKV_[B]	16.24	13.87	20.01	15.91	-
QuaRot	8.23	7.4	8.3	7.51	13.42
+IntactKV_[B]	8.05	7.32	8.12	7.25	12.23

Reference

If you find IntactKV helpful, please cite our paper:

@inproceedings{liu2024intactkv,
  title={IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact},
  author={Liu, Ruikang and Bai, Haoli and Lin, Haokun and Li, Yuening and Gao, Han and Xu, Zhengzhuo and Hou, Lu and Yao, Jun and Yuan, Chun},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2024},
  year={2024}
}