Awesome

QuIP

This repo is a adaptation of jerry-chee/QuIP.

more model architectures
model save and load
channel-wise quantization

Please install the cuda kernel first.

pip install -r requirements.txt
python setup_cuda.py install

The following are perplexity scores of LLaMA-2-70b on Wikitext dataset with 512 stride and 2048 max length. models are quantized with random samples from C4.

fp16: 4.062
3bit: 4.508 (Huggingface Link)
2bit: 7.150 (Huggingface Link)

Usage

Quantize

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from quantizer import QuipQuantizer

model_name = "meta-llama/Llama-2-70b-hf"
quant_dir = "llama-70b_2bit_quip"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

quant = QuipQuantizer(bits=2, dataset="c4")
quant_model = quant.quantize_model(model, tokenizer)
quant.save(quant_model, quant_dir)

Inference

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import init_empty_weights
from quantizer import load_quantized_model

model_name = "meta-llama/Llama-2-70b-hf"
quant_dir = "llama-70b_2bit_quip"

with init_empty_weights():
    empty_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
empty_model.tie_weights()
quant_model = load_quantized_model(empty_model, save_folder=quant_dir, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_ids = tokenizer.encode("The capital of France is", return_tensors="pt").cuda()
print(tokenizer.decode(quant_model.generate(input_ids, do_sample=True)[0]))