Home

Awesome

SpinQuant

This repository contains the code of SpinQuant introduced in our work: "SpinQuant: LLM Quantization with Learned Rotations"

In this work, we found that

  1. Rotation is a principle way to remove outliers in the LLMs and assist quantization;
  2. Not all rotation helps equally and random rotations produce a large variance in quantized models;
  3. Learning rotation with Cayley optimization greatly enhance the final performance.

As a result, SpinQuant narrows the accuracy gap of W4A4KV4 quantization with full precision to merely 2.9 points for the LLaMA-2 7B model on zero-shot reasoning tasks, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points.

<div align=center> <img width=80% src="./SpinQuant.png"/> </div>

Citation

If you find our code useful for your research, please consider citing:

@article{liu2024spinquant,
    title={SpinQuant--LLM quantization with learned rotations},
    author={Liu, Zechun and Zhao, Changsheng and Fedorov, Igor and Soran, Bilge and Choudhary, Dhruv and Krishnamoorthi, Raghuraman and Chandra, Vikas and Tian, Yuandong and Blankevoort, Tijmen},
    journal={arXiv preprint arXiv:2405.16406},
    year={2024}
}

Run

1. Requirements:

2. Steps to run:

Step 1: Optimize Rotation Matrix

Step 2: Run PTQ evaluation with optimized rotation

3. Export to ExecuTorch

We also support exporting the quantized model to ExecuTorch, which allows us to utilize the quantization kernels and achieve real-time speedup. For more information on kernel implementation details, please see ExecuTorch, and ExecuTorch for LLaMA. We currently support 4-bit weight (group-size 256) and 8-bit dynamic activation quantization.

To obtain ExecuTorch-compatible quantized models, you can use the following scripts:

Note

Arguments

Quantized Models

ModelLLaMA-3 8BLLaMA-3 70BLLaMA-2 7BLLaMA-2 13BLLaMA-2 70B
MethodZero-shotWiki2Zero-shotWiki2Zero-shotWiki2Zero-shotWiki2Zero-shotWiki2
FloatingPoint69.66.174.52.866.95.568.35.072.93.3
W4A16KV16
RTN65.47.835.51e563.67.257.96.469.24.6
SmoothQuant61.010.766.912.059.17.563.36.170.24.1
LLM-QAT67.77.1----64.95.9--------
GPTQ66.57.235.71e564.511.364.75.671.93.9
QuaRot68.46.470.37.965.85.668.35.072.23.5
SpinQuant68.56.471.64.865.95.668.55.072.63.5
W4A4KV16
RTN38.59e235.61e535.62e335.37e335.12e5
SmoothQuant40.38e255.318.041.82e244.934.564.657.1
LLM-QAT44.942.9----47.812.9--------
GPTQ37.09e235.31e536.88e335.35e335.52e6
QuaRot63.87.965.420.463.56.166.75.470.43.9
SpinQuant65.87.169.55.564.15.967.25.271.03.8
W4A4KV4
RTN38.21e335.21e537.12e335.47e335.02e5
SmoothQuant38.71e352.422.139.06e240.556.655.910.5
LLM-QAT43.252.5----44.914.9--------
GPTQ37.11e335.11e536.89e335.25e335.61e6
QuaRot63.38.065.120.262.56.466.25.470.33.9
SpinQuant65.27.369.35.564.05.966.95.371.23.8

You can download the optimized rotation matrices here.

Acknowledgement

The results reported in the paper is run with the internal LLaMA codebase in Meta. We reproduced our experiments with HuggingFace codebase and released code here, which partially based on HuggingFace transformers, QuaRot, QuIP# and Optimization-on-Stiefel-Manifold-via-Cayley-Transform.

Contact

Zechun Liu, Reality Labs, Meta Inc (zechunliu at meta dot com)

Changsheng Zhao, Reality Labs, Meta Inc (cszhao at meta dot com)

Relevant Projects

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases [Paper] [Code]

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models [Paper] [Code]

License

BiT is CC-BY-NC 4.0 licensed as of now.