Home

Awesome

GPTQ-for-LLaMA

I am currently focusing on AutoGPTQ and recommend using AutoGPTQ instead of GPTQ for Llama.

<img src = https://user-images.githubusercontent.com/64115820/235287009-2d07bba8-9b85-4973-9e06-2a3c28777f06.png width="50%" height="50%">

4 bits quantization of LLaMA using GPTQ

GPTQ is SOTA one-shot weight quantization method

It can be used universally, but it is not the fastest and only supports linux.

Triton only supports Linux, so if you are a Windows user, please use WSL2.

News or Update

AutoGPTQ-triton, a packaged version of GPTQ with triton, has been integrated into AutoGPTQ.

Result

<details> <summary>LLaMA-7B(click me)</summary>
LLaMA-7BBitsgroup-sizememory(MiB)Wikitext2checkpoint size(GB)
FP1616-139405.6812.5
RTN4--6.29-
GPTQ4-47406.093.5
GPTQ412848915.853.6
RTN3--25.54-
GPTQ3-38528.072.7
GPTQ312841166.613.0
</details> <details> <summary>LLaMA-13B</summary>
LLaMA-13BBitsgroup-sizememory(MiB)Wikitext2checkpoint size(GB)
FP1616-OOM5.0924.2
RTN4--5.53-
GPTQ4-84105.366.5
GPTQ412887475.206.7
RTN3--11.40-
GPTQ3-68706.635.1
GPTQ312872775.625.4
</details> <details> <summary>LLaMA-33B</summary>
LLaMA-33BBitsgroup-sizememory(MiB)Wikitext2checkpoint size(GB)
FP1616-OOM4.1060.5
RTN4--4.54-
GPTQ4-194934.4515.7
GPTQ4128205704.2316.3
RTN3--14.89-
GPTQ3-154935.6912.0
GPTQ3128165664.8013.0
</details> <details> <summary>LLaMA-65B</summary>
LLaMA-65BBitsgroup-sizememory(MiB)Wikitext2checkpoint size(GB)
FP1616-OOM3.53121.0
RTN4--3.92-
GPTQ4-OOM3.8431.1
GPTQ4128OOM3.6532.3
RTN3--10.59-
GPTQ3-OOM5.0423.6
GPTQ3128OOM4.1725.6
</details>

Quantization requires a large amount of CPU memory. However, the memory required can be reduced by using swap memory.

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(https://github.com/IST-DASLab/gptq/issues/1)

According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

GPTQ vs bitsandbytes

<details> <summary>LLaMA-7B(click me)</summary>
LLaMA-7B(seqlen=2048)Bits Per Weight(BPW)memory(MiB)c4(ppl)
FP1616139485.22
GPTQ-128g4.1547815.30
nf4-double_quant4.12748045.30
nf44.551025.30
fp44.551025.33
</details> <details> <summary>LLaMA-13B</summary>
LLaMA-13B(seqlen=2048)Bits Per Weight(BPW)memory(MiB)c4(ppl)
FP1616OOM-
GPTQ-128g4.1585895.02
nf4-double_quant4.12785815.04
nf44.591705.04
fp44.591705.11
</details> <details> <summary>LLaMA-33B</summary>
LLaMA-33B(seqlen=1024)Bits Per Weight(BPW)memory(MiB)c4(ppl)
FP1616OOM-
GPTQ-128g4.15184413.71
nf4-double_quant4.127183133.76
nf44.5197293.75
fp44.5197293.75
</details>

Installation

If you don't have conda, install it first.

conda create --name gptq python=3.9 -y
conda activate gptq
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# Or, if you're having trouble with conda, use pip with python3.9:
# pip3 install torch torchvision torchaudio

git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
pip install -r requirements.txt

Dependencies

All experiments were run on a single NVIDIA RTX3090.

Language Generation

LLaMA

#convert LLaMA to hf
python convert_llama_weights_to_hf.py --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir ./llama-hf

# Benchmark language generation with 4-bit LLaMA-7B:

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save llama7b-4bit-128g.pt

# Or save compressed `.safetensors` model
CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors llama7b-4bit-128g.safetensors

# Benchmark generating a 2048 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --benchmark 2048 --check

# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py ${MODEL_DIR} c4 --benchmark 2048 --check

# model inference with the saved model
CUDA_VISIBLE_DEVICES=0 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama"

# model inference with the saved model using safetensors loaded direct to gpu
CUDA_VISIBLE_DEVICES=0 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.safetensors --text "this is llama" --device=0

# model inference with the saved model with offload(This is very slow).
CUDA_VISIBLE_DEVICES=0 python llama_inference_offload.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama" --pre_layer 16
It takes about 180 seconds to generate 45 tokens(5->50 tokens) on single RTX3090 based on LLaMa-65B. pre_layer is set to 50.

Basically, 4-bit quantization and 128 groupsize are recommended.

You can also export quantization parameters with toml+numpy format.

CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --quant-directory ${TOML_DIR}

Acknowledgements

This code is based on GPTQ

Thanks to Meta AI for releasing LLaMA, a powerful LLM.

Triton GPTQ kernel code is based on GPTQ-triton