Awesome
Reduced-precision MX-format Framework for LLM inference
Getting Started
Our experiments are tested on A100 + CUDA Toolkit 11.8 + PyTorch 2.1.2
git clone https://github.com/aiha-lab/MX-QLLM.git
cd MX-QLLM && bash setup.sh
Shared Scale and Element Format
Set Shared Scale
- PoT (MX Default Setting):
scale_mode=0
- PoT-R:
scale_mode=3
- FP16:
scale_mode=2
- FP8 (E5M2):
scale_mode=152
- FP8 (E4M3):
scale_mode=143
Set Element Format
- FP8 (E4M3):
fp8_e4m3
- FP6 (E3M2):
fp6_e3m2
- FP4 (E2M1):
fp4_e2m1
- AsymFP4 (E2M1):
fp4_e2m1_asym
- INT4:
int4
- AsymINT4:
int4_asym
Example Usage
All arguments are in scripts/run.sh
bash scripts/run.sh [DEVICE_NUM] [MODEL_PATH]
# e.g. bash scripts/run.sh 0 LLMDIR/llama2-7b
MXFP4-PoT
...
for format in fp4_e2m1
do
...
for scale_mode in 0
do
...
AMXFP4-FP8
...
for format in fp4_e2m1_asym
do
...
for scale_mode in 152
do
...
AMXFP4-FP8 with Randomized Hadamard Rotation
...
quarot=true
rotate_mode=hadamard
rotate_kv=true
kv_quant_only=false
kv_tokenwise=false
...
for format in fp4_e2m1_asym
do
...
for scale_mode in 152
do
...
References
MX Pytorch Emulation Library (https://github.com/microsoft/microxcaling)
@misc{rouhani2023microscalingdataformatsdeep,
title={Microscaling Data Formats for Deep Learning},
author={Bita Darvish Rouhani and Ritchie Zhao and Ankit More and Mathew Hall and Alireza Khodamoradi and Summer Deng and Dhruv Choudhary and Marius Cornea and Eric Dellinger and Kristof Denolf and Stosic Dusan and Venmugil Elango and Maximilian Golub and Alexander Heinecke and Phil James-Roxby and Dharmesh Jani and Gaurav Kolhe and Martin Langhammer and Ada Li and Levi Melnick and Maral Mesmakhosroshahi and Andres Rodriguez and Michael Schulte and Rasoul Shafipour and Lei Shao and Michael Siu and Pradeep Dubey and Paulius Micikevicius and Maxim Naumov and Colin Verrilli and Ralph Wittig and Doug Burger and Eric Chung},
year={2023},
eprint={2310.10537},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2310.10537},
}
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs (https://github.com/spcl/QuaRot)
@article{ashkboos2024quarot,
title={QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs},
author={Ashkboos, Saleh and Mohtashami, Amirkeivan and Croci, Maximilian L and Li, Bo and Jaggi, Martin and Alistarh, Dan and Hoefler, Torsten and Hensman, James},
journal={arXiv preprint arXiv:2404.00456},
year={2024}
}