Home

Awesome

<div align="center">

AutoRound

<h3> Advanced Quantization Algorithm for LLMs</h3>

python version license

<div align="left">

AutoRound is an advanced quantization algorithm for low-bits LLM/VLM inference. It's tailored for a wide range of models. AutoRound adopts sign gradient descent to fine-tune rounding values and minmax values of weights in just 200 steps, which competes impressively against recent methods without introducing any additional inference overhead and keeping low tuning cost. The below image presents an overview of AutoRound. Check out our paper on arxiv for more details and quantized models in several Hugging Face Spaces, e.g. OPEA, Kaitchup and fbaldassarri.

<div align="center">

<div align="left">

What's New

Installation

Install from pypi

pip install auto-round
<details> <summary>Build from Source</summary>
pip install -vvv --no-build-isolation .
</details>

Model Quantization

Basic Usage (Gaudi2/CPU/GPU)

A user guide detailing the full list of supported arguments is provided by calling auto-round -h on the terminal. Set the format you want in format and multiple formats exporting has been supported. Please check out step-by-step-instruction for more details about calibration dataset or evaluation.

auto-round \
    --model facebook/opt-125m \
    --bits 4 \
    --group_size 128 \
    --format "auto_round,auto_gptq" \
    --disable_eval \
    --output_dir ./tmp_autoround

We provide two recipes for best accuracy and fast running speed with low memory. Details as below.

<details> <summary>Other Recipes</summary>
## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
auto-round \
  --model facebook/opt-125m \
  --bits 4 \
  --group_size 128 \
  --nsamples 512 \
  --iters 1000 \
  --low_gpu_mem_usage \
  --disable_eval 
## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
auto-round \
  --model facebook/opt-125m \
  --bits 4 \
  --group_size 128 \
  --nsamples 128 \
  --iters 200 \
  --seqlen 512 \
  --batch_size 4 \
  --disable_eval 
</details>

Formats

AutoRound Format: This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision inference. [2,4] bits are supported. It also benefits from the Marlin kernel, which can boost inference performance notably. However, it has not yet gained widespread community adoption.

AutoGPTQ Format: This format is well-suited for symmetric quantization on CUDA devices and is widely adopted by the community, [2,3,4,8] bits are supported. It also benefits from the Marlin kernel, which can boost inference performance notably. However, the asymmetric kernel has issues that can cause considerable accuracy drops, particularly at 2-bit quantization and small models. Additionally, symmetric quantization tends to perform poorly at 2-bit precision.

AutoAWQ Format: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely adopted within the community, only 4-bits quantization is supported. It features specialized layer fusion tailored for Llama models.

API Usage (Gaudi2/CPU/GPU)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, True
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym)

## the best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size, sym=sym)

## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
# autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size, sym=sym )

autoround.quantize()
output_dir = "./tmp_autoround"
## format= 'auto_round'(default in version>0.3.0), 'auto_gptq', 'auto_awq'
autoround.save_quantized(output_dir, format='auto_round', inplace=True) 
<details> <summary>Detailed Hyperparameters</summary> </details>

API Usage for VLMs

This feature is experimental and may be subject to changes, including potential bug fixes, API modifications, or adjustments to default hype-parameters

By default, AutoRoundMLLM only quantizes the text module of VLMs and uses NeelNanda/pile-10k for calibration. To quantize the entire model, you can enable quant_nontext_module by setting it to True, though support for this feature is limited. For more information, please refer to the AutoRoundMLLM readme.

from auto_round import AutoRoundMLLM
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, AutoTokenizer

## load the model
model_name = "Qwen/Qwen2-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

## quantize the model
bits, group_size, sym = 4, 128, True
autoround = AutoRoundMLLM(model, tokenizer, processor,
                          bits=bits, group_size=group_size, sym=sym)
autoround.quantize()

# save the quantized model, set format='auto_gptq' to use AutoGPTQ format
output_dir = "./tmp_autoround"
autoround.save_quantized(output_dir, format='auto_round', inplace=True)

Quantization Costs

Testing was conducted on the Nvidia A100 80G using the nightly version of PyTorch 2.6.0.dev20241029+cu124. Please note that data loading and packing costs have been excluded from the evaluation. We enable torch.compile for Torch 2.6, but not for 2.5 due to encountered issues.

To optimize GPU memory usage, in addition to activating low_gpu_mem_usage, you can set gradient_accumulate_steps=8 and a batch_size=1, though this may increase tuning time.

The 3B and 14B models were evaluated on Qwen 2.5, the 8X7B model is Mixtral, while the remaining models utilized LLaMA 3.1.

Torch version/Config W4G1283B8B14B70B8X7B
2.6 with torch compile7min<br/>10GB12min<br/>18GB23min<br/>22GB120min<br/>42GB28min<br/>46GB
2.6 with torch compile <br/> low_gpu_mem_usage=True12min<br/>6GB19min<br/>10GB33min<br/>11GB140min<br/>25GB38min<br/>36GB
2.6 with torch compile <br/> low_gpu_mem_usage=True <br/> gradient_accumulate_steps=8,bs=115min<br/>3GB25min<br/>6GB45min<br/>7GB187min<br/>19GB75min<br/>36GB
2.5 w/o torch compile8min<br/>10GB16min<br/>20GB30min<br/>25GB140min<br/>49GB50min<br/>49GB

Model Inference

Please run the quantization code first

AutoRound format

CPU: pip install intel-extension-for-pytorch(much higher speed on Intel CPU) or pip install intel-extension-for-transformers,

HPU: docker image with Gaudi Software Stack is recommended. More details can be found in Gaudi Guide.

CUDA: no extra operations for sym quantization, for asym quantization, need to install auto-round from source

CPU/HPU/CUDA

from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRoundConfig

backend = "auto"  ##cpu, hpu, cuda
quantization_config = AutoRoundConfig(
    backend=backend
)
quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
                                             device_map=backend.split(':')[0],
                                             quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
<br> <details> <summary>Evaluation</summary>
auto-round --model saved_quantized_model \
    --eval \
    --task lambada_openai \
    --eval_bs 1
</details>

AutoGPTQ/AutoAWQ format

from transformers import AutoModelForCausalLM, AutoTokenizer

quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
                                             device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

Support List

AutoRound supports basically all the major large language models.

Please note that an asterisk (*) indicates third-party quantized models, which may lack accuracy data and use a different recipe. We greatly appreciate their efforts and encourage more users to share their models, as we cannot release most of the models ourselves.

ModelSupported
nvidia/Llama-3.1-Nemotron-70B-Instruct-HFmodel-opea-int4-sym-autoround, model-opea-int4-sym-autogptq,
meta-llama/Llama-3.2-90B-Vision-Instructmodel-opea-int4-sym-autoround, model-opea-int4-sym-autogptq
Qwen/QwQ-32B-Previewmodel-opea-int4-sym-autoround-mixed,model-opea-int4-sym-autoawq-mixed
THUDM/cogvlm2-llama3-chat-19Bmodel-opea-int4-sym-autoround
Qwen/Qwen2-VL-Instructmodel-opea-int4-sym-autoround,model-opea-int4-sym-autogptq
meta-llama/Llama-3.2-11B-Visionmodel-opea-int4-sym-autoround, model-opea-int4-sym-autogptq
microsoft/Phi-3.5-vision-instructmodel-opea-int4-sym-autoround, model-opea-int4-sym-gptq
liuhaotian/llava-v1.5-7bmodel-opea-int4-sym-autoround,model-opea-int4-sym-autogptq
Qwen/Qwen2.5-7B-Instructmodel-opea-int4-sym-autoround,model-opea-int4-sym-autogptq model-kaitchup-autogptq-int4*, recipe
Qwen/Qwen2.5-14B-Instructmodel-opea-int4-sym-autoround,model-opea-int4-sym-autogptq
Qwen/Qwen2.5-32B-Instructmodel-opea-int4-sym-autoround
Qwen/Qwen2.5-Coder-32B-Instructmodel-kaitchup-autogptq-int4*
Qwen/Qwen2.5-72B-Instructmodel-opea-int4-sym-autoround,model-opea-int4-sym-autogptq, model-kaitchup-autogptq-int4*, model-kaitchup-autogptq-int2*, recipe
meta-llama/Meta-Llama-3.1-70B-Instructmodel-opea-int4-sym-autoround, model-opea-int4-sym-autogptq,model-opea-int4-asym-autoround
meta-llama/Meta-Llama-3.1-8B-Instructmodel-opea-int4-sym-autoround,model-opea-int4-sym-autogptq,model-kaitchup-autogptq-int4*, model-kaitchup-autogptq-sym-int4*, recipe
meta-llama/Meta-Llama-3.1-8Bmodel-kaitchup-autogptq-sym-int4*
Qwen/Qwen2-7Bmodel-autoround-sym-int4, model-autogptq-sym-int4
THUDM/glm-4-9b-chatmodel-opea-int4-sym-autoround,model-opea-int4-sym-autogptq
Qwen/Qwen2-57B-A14B-Instructmodel-autoround-sym-int4,model-autogptq-sym-int4
01-ai/Yi-1.5-9Bmodel-LnL-AI-autogptq-int4*
01-ai/Yi-1.5-9B-Chatmodel-LnL-AI-autogptq-int4*
Intel/neural-chat-7b-v3-3model-autogptq-int4
Intel/neural-chat-7b-v3-1model-autogptq-int4
TinyLlama-1.1B-intermediatemodel-LnL-AI-autogptq-int4*
mistralai/Mistral-7B-v0.1model-autogptq-lmhead-int4, model-autogptq-int4
google/gemma-2bmodel-autogptq-int4
tiiuae/falcon-7bmodel-autogptq-int4-G64
sapienzanlp/modello-italia-9bmodel-fbaldassarri-autogptq-int4*
microsoft/phi-2model-autoround-sym-int4 model-autogptq-sym-int4
microsoft/Phi-3.5-mini-instructmodel-kaitchup-autogptq-sym-int4*
mistralai/Mistral-7B-Instruct-v0.2outdated-recipe
mistralai/Mixtral-8x7B-Instruct-v0.1outdated-recipe
mistralai/Mixtral-8x7B-v0.1outdated-recipe
meta-llama/Meta-Llama-3-8B-Instructoutdated-recipe
google/gemma-7boutdated-recipe
meta-llama/Llama-2-7b-chat-hfoutdated-recipe
baichuan-inc/Baichuan2-7B-Chatoutdated-recipe
01-ai/Yi-6B-Chatoutdated-recipe
facebook/opt-2.7boutdated-recipe
bigscience/bloom-3boutdated-recipe
EleutherAI/gpt-j-6boutdated-recipe

Integration

AutoRound has been integrated into multiple repositories.

Intel Neural Compressor

ModelCloud/GPTQModel

pytorch/ao

Reference

If you find AutoRound useful for your research, please cite our paper:

@article{cheng2023optimize,
  title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
  author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
  journal={arXiv preprint arXiv:2309.05516},
  year={2023}
}