Home

Awesome

PrefixQuant

Official PyTorch implement for PrefixQuant:Static Quantization Beats Dynamic through Prefixed Outliers in LLMs.

News

[2024/10] We release PrefixQuant, the first work to let static activation quantization outperforms dynamic ones in LLM. We only open the fake quantization code now, and the inference kernels will be released later.

Contents

Installation

conda create -n prefixquant python==3.9

conda activate prefixquant

pip install -r requirements.txt

Quantization

We provide an example command to quantized Llama-3-8B without fine-tuning:

CUDA_VISIBLE_DEVICES=0 python main.py \
--model_path path/to/llama-3-8B  \
--model_name Llama-3-8b \
--output_dir ./log/llama-3-8b-w4a4kv4 \
--wbits 4 \
--input_bits 4 \
--input_mode static \
--v_bits 4 \
--k_bits 4 \
--kv_group_size 128 \
--kv_mode static \
--mse_init \
--pre_rotate \
--down_online_had \
--qk_online_had \
--set_prefixed_tokens \
--eval_ppl \
--eval_tasks  piqa,arc_easy,arc_challenge,hellaswag,winogrande \
--save_quant_dir ./pre_quantized_models/llama-3-8b-w4a4kv4

You can find the detailed fine-tuning setting in the paper. There are some useful information as follows:

Inference

We provide an example command to evaluate the quantize models:

CUDA_VISIBLE_DEVICES=0 python eval.py \
--quant_model ./pre_quantized_models/llama-3-8b-w4a4kv4 \
--eval_ppl \
--eval_tasks  piqa,arc_easy,arc_challenge,hellaswag,winogrande

Plot Activation Distribution

We provide an example command to visualize token-wsie maximum values for linear inputs:

CUDA_VISIBLE_DEVICES=0 python draw_activation.py \
--model_path path/to/llama-2-7b \
--model_name llama-2-7b \
--plot_linear_input

You can add --pre_rotate --down_online_had --qk_online_had to apply hadamard rotation, and add --set_prefixed_tokens to set the proposed prefixed tokens in our paper. Additionally, you can also change --plot_linear_input to other plotting choices, details are as follows:

More examples can be found in ./examples/plot.sh.

Citation

If you use our PrefixQuant approach in your research, please cite our paper:

@article{prefixquant,
  title={PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs},
  author={Chen, Mengzhao and  Liu, Yi and Wang, Jiahao and Bin, Yi and Shao, Wenqi and Luo, Ping},
  journal={arXiv preprint arXiv:2410.05265},
  year={2024}
}