Home

Awesome

SpinQuant

This repository contains the code of SpinQuant introduced in our work: "SpinQuant: LLM Quantization with Learned Rotations"

In this work, we found that

  1. Rotation is a principle way to remove outliers in the LLMs and assist quantization;
  2. Not all rotation helps equally and random rotations produce a large variance in quantized models;
  3. Learning rotation with Cayley optimization greatly enhance the final performance.

As a result, SpinQuant narrows the accuracy gap of W4A4KV4 quantization with full precision to merely 2.9 points for the LLaMA-2 7B model on zero-shot reasoning tasks, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points.

<div align=center> <img width=80% src="./SpinQuant.png"/> </div>

News

https://github.com/user-attachments/assets/e7bebfd5-ef13-440f-8066-7bf9205ad309

Citation

If you find our code useful for your research, please consider citing:

@article{liu2024spinquant,
    title={SpinQuant--LLM quantization with learned rotations},
    author={Liu, Zechun and Zhao, Changsheng and Fedorov, Igor and Soran, Bilge and Choudhary, Dhruv and Krishnamoorthi, Raghuraman and Chandra, Vikas and Tian, Yuandong and Blankevoort, Tijmen},
    journal={arXiv preprint arXiv:2405.16406},
    year={2024}
}

Run

1. Requirements:

2. Steps to run:

For the scripts here, set output_rotation_path output_dir logging_dir optimized_rotation_path to your own locations. For gated repo such as meta-llama, you can set your HF token to access_token.

Step 1: Optimize Rotation Matrix

Step 2: Run PTQ evaluation with optimized rotation

After obtaining the optimized_rotation, put the rotation matrix into optimized_rotation_path for evaluation.

3. Export to ExecuTorch

We also support exporting the quantized model to ExecuTorch, which allows us to utilize the quantization kernels and achieve real-time speedup. For more information on kernel implementation details, please see ExecuTorch, and ExecuTorch with SpinQuant. We currently support 4-bit weight (set group-size to 256 for 8B model and to 32 for smaller model) and 8-bit dynamic activation quantization.

To obtain ExecuTorch-compatible quantized models, you can use the following scripts:

We also provide an example colab notebook to train and export ExecuTorch compatiable Llama 3.2 models

Note

Arguments

Quantized Models

ModelLLaMA-3 8BLLaMA-3 70BLLaMA-2 7BLLaMA-2 13BLLaMA-2 70B
MethodZero-shotWiki2Zero-shotWiki2Zero-shotWiki2Zero-shotWiki2Zero-shotWiki2
FloatingPoint69.66.174.52.866.95.568.35.072.93.3
W4A16KV16
RTN65.47.835.51e563.67.257.96.469.24.6
SmoothQuant61.010.766.912.059.17.563.36.170.24.1
LLM-QAT67.77.1----64.95.9--------
GPTQ66.57.235.71e564.511.364.75.671.93.9
QuaRot68.46.470.37.965.85.668.35.072.23.5
SpinQuant68.56.471.64.865.95.668.55.072.63.5
W4A4KV16
RTN38.59e235.61e535.62e335.37e335.12e5
SmoothQuant40.38e255.318.041.82e244.934.564.657.1
LLM-QAT44.942.9----47.812.9--------
GPTQ37.09e235.31e536.88e335.35e335.52e6
QuaRot63.87.965.420.463.56.166.75.470.43.9
SpinQuant65.87.169.55.564.15.967.25.271.03.8
W4A4KV4
RTN38.21e335.21e537.12e335.47e335.02e5
SmoothQuant38.71e352.422.139.06e240.556.655.910.5
LLM-QAT43.252.5----44.914.9--------
GPTQ37.11e335.11e536.89e335.25e335.61e6
QuaRot63.38.065.120.262.56.466.25.470.33.9
SpinQuant65.27.369.35.564.05.966.95.371.23.8

You can download the optimized rotation matrices here.

Acknowledgement

The results reported in the paper is run with the internal LLaMA codebase in Meta. We reproduced our experiments with HuggingFace codebase and released code here, which partially based on HuggingFace transformers, QuaRot, QuIP# and Optimization-on-Stiefel-Manifold-via-Cayley-Transform. SpinQuant is available in LLMC, an Efficient LLM Compression Toolkit.

Contact

Zechun Liu, Reality Labs, Meta Inc (zechunliu at meta dot com)

Changsheng Zhao, Reality Labs, Meta Inc (cszhao at meta dot com)

Relevant Projects

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases [Paper] [Code]

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models [Paper] [Code]

License

BiT is CC-BY-NC 4.0 licensed as of now.