Home

Awesome

EfficientQAT

Official PyTorch implement of paper EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

News

Contents

Installation

  1. Clone this repository and navigate to EfficientQAT folder
git clone https://github.com/OpenGVLab/EfficientQAT.git
cd EfficientQAT
  1. Install package
conda create -n efficientqat python==3.11

conda activate efficientqat

pip install -r requirements.txt

Model Zoo

We provide a number of prequantized EfficientQAT models as follows:

ModelQuantizationWikiText2 PPLAvg. AccuracyModel Size (GB)Hub link
Llama-2-7Bfp165.4764.8613.2-
Llama-2-7Bw4g1285.5364.273.7EQAT|GPTQ|BitBLAS
Llama-2-7Bw3g1285.8164.023.1EQAT
Llama-2-7Bw2g646.8660.142.3EQAT|GPTQ|BitBLAS
Llama-2-7Bw2g1287.1759.502.2EQAT|GPTQ|BitBLAS
Llama-2-13Bfp164.8867.8125.4-
Llama-2-13Bw4g1284.9367.526.8EQAT|GPTQ|BitBLAS
Llama-2-13Bw3g1285.1267.285.6EQAT
Llama-2-13Bw2g645.9664.884.0EQAT|GPTQ|BitBLAS
Llama-2-13Bw2g1286.0863.883.8EQAT|GPTQ|BitBLAS
Llama-2-70Bfp163.3272.41131.6-
Llama-2-70Bw4g1283.3972.6235.8EQAT|GPTQ|BitBLAS
Llama-2-70Bw3g1283.6171.7629.1EQAT
Llama-2-70Bw2g644.5269.4820.1EQAT|GPTQ|BitBLAS
Llama-2-70Bw2g1284.6168.9318.9EQAT|GPTQ|BitBLAS
Llama-3-8Bfp166.1468.5813.0-
Llama-3-8Bw4g1286.4768.435.4EQAT|GPTQ|BitBLAS
Llama-3-8Bw3g1287.0967.354.7EQAT
Llama-3-8Bw2g649.4160.763.9EQAT|GPTQ|BitBLAS
Llama-3-8Bw2g1289.8059.363.8EQAT|GPTQ|BitBLAS
Llama-3-70Bfp162.8575.33137.8-
Llama-3-70Bw4g1283.1774.5738.9EQAT|GPTQ|BitBLAS
Llama-3-70Bw3g1284.1972.4232.2EQAT
Llama-3-70Bw2g646.0867.8923.2EQAT|GPTQ
Llama-3-70Bw2g1286.3867.5722.0EQAT|GPTQ|BitBLAS
Llama-3-8B-Instructfp168.2968.4313.0-
Llama-3-8B-Instructw4g1287.9368.395.4EQAT|GPTQ|BitBLAS
Llama-3-8B-Instructw3g1288.5567.244.7EQAT
Llama-3-8B-Instructw2g6411.1960.663.9EQAT|GPTQ|BitBLAS
Llama-3-8B-Instructw2g12811.7360.163.8EQAT|GPTQ|BitBLAS
Llama-3-70B-Instructfp165.3373.78137.8-
Llama-3-70B-Instructw4g1285.3573.4738.9EQAT|GPTQ|BitBLAS
Llama-3-70B-Instructw3g1285.6572.8732.2EQAT
Llama-3-70B-Instructw2g647.8667.6423.2EQAT|GPTQ|BitBLAS
Llama-3-70B-Instructw2g1288.1467.5422.0EQAT|GPTQ|BitBLAS
Mistral-Large-Instruct-2407fp162.7477.76228.5-
Mistral-Large-Instruct-2407w2g645.5873.5435.5GPTQ

Training

EfficientQAT involves two consecutive training phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). The detailed training script can be found in ./examples. We give the training script examples on Llama-2-7B with w2g64 quantization in the following.

  1. Block-AP

You should modify --model to the folder of full-precision model in the script before you running the following command.

bash examples/block_ap/Llama-2-7b/w2g64.sh

Specifically, the --weight_lr is 2e-5 for 2-bit and 1e-5 for 3-/4-bits in our experiments.

Some other important arguments:

  1. E2E-QP

Then, you can load the quantized model of Block-AP for further E2E-QP. Specifically, E2E-QP can adapt to different scenarios by changing the training datasets. You should modify --quant_model_path to the folder of quantized model in the script before you running the following command.

1) Train on RedPajama

bash examples/e2e_qp/Llama-2-7b/w2g64-redpajama.sh

2) Train on Alpaca

bash examples/e2e_qp/Llama-2-7b/w2g128-redpajama.sh

Specifically, the --learning_rate is 2e-5 for 2-bit and 1e-5 for 3-/4-bits in our experiments. You can decrease the --per_device_train_batch_size to reduce the memory footprint during training, and making sure that --gradient_accumulation_steps increases by the same multiple to maintain the same batch size.

Inference

  1. Download the pre-quantized EfficientQAT models from Huggingface
pip install huggingface_hub

huggingface-cli download ChenMnZ/Llama-2-7b-EfficientQAT-w2g64 --local-dir ./output/pre_quantized_models/Llama-2-7b-EfficientQAT-w2g64
  1. Evaluate the pre-quantized EfficientQAT model
CUDA_VISIBLE_DEVICES=0 python main_block_ap.py \
--resume_quant ./output/pre_quantized_models/Llama-2-7b-EfficientQAT-w2g64 \
--net Llama-2 \
--wbits 2 \
--group_size 64 \
--output_dir ./output/inference_results/ \
--eval_ppl \
--eval_tasks  piqa,arc_easy,arc_challenge,hellaswag,winogrande

Model Transferring

Firstly, you should install gptqmodel package to support GPTQ and BitBLAS quantization format:

git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
bash install.sh

Then, we offer three types of transferring as follows:

  1. Transfer EfficientQAT checkpoints to GPTQ format
bash examples/model_transfer/efficientqat_to_gptq/llama-2-7b.sh
  1. Transfer EfficientQAT checkpoints to BitBLAS format
bash examples/model_transfer/efficientqat_to_bitblas/llama-2-7b.sh
  1. Transfer fp32 datas in EfficientQAT checkpoints to half-precision counterparts. Some of parameters are saved as fp32 for training, you can transfer them into half-precision to further reducing model size after training.
bash examples/model_transfer/fp32_to_16/llama-2-7b.sh

Inference of Other Formats

Below is an example to inference with GPTQ or BitBLAS quantized formats.

from transformers import AutoTokenizer
from gptqmodel import GPTQModel

quant_dir = "ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-GPTQ"
# quant_dir = "ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-BitBLAS"
# or local path

tokenizer = AutoTokenizer.from_pretrained(quant_dir, use_fast=True)


# load quantized model to the first GPU
model = GPTQModel.from_quantized(quant_dir)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("Model quantization is", return_tensors="pt").to(model.device))[0]))

Citation

If you found this work useful, please consider citing:

@article{efficientqat,
  title={EfficientQAT: Efficient Quantization-Aware Training for Large Language Models},
  author={Chen, Mengzhao and Shao, Wenqi and Xu, Peng and Wang, Jiahao and Gao, Peng and Zhang, Kaipeng and Qiao, Yu and Luo, Ping},
  journal={arXiv preprint arXiv:2407.11062},
  year={2024}
}