Awesome
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM [Paper]
features and milestone:
- DGQ algorithm for A8W4 models.
- Memory-efficient Linear Layers for FakeQuant For Pytorch.
- Efficient CUTLASS kernel implementation for fast inference.
- Edge Device Support. [We are working with it.]
Install
conda create -n dgq python=3.10 -y
conda activate dgq
pip install --upgrade pip # enable PEP 660 support
pip install -e .
Kernel install
CUDA 12.1 need to be installed first. We recommend using the bitsandbytes script(https://raw.githubusercontent.com/TimDettmers/bitsandbytes/main/cuda_install.sh).
source environment.sh
bash build_cutlass.sh
cd dgq/kernels/
python setup.py install
Usage
We provide a sample script to run DGQ('./llama7b.sh')
- Perform DGQ quantization and save the true quant model:
python -m dgq.entry [your-model-path] [dataset] --wt_fun search --groupsize 128 --wbits 4 --smoothquant --w4w8 --kvquant --save_safetensors [path-to-save]
- Load and evaluate the real quantized model:
python -m dgq.entry [your-model-path] [dataset] --wt_fun search --groupsize 128 --wbits 4 --smoothquant --w4w8 --kvquant --load [path-to-save] --eval
Reference
If you find our work useful or relevant to your research, please kindly cite our paper:
@article{zhang2023dual,
title={Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM},
author={Zhang, Luoming and Fei, Wen and Wu, Weijia and He, Yefei and Lou, Zhenyu and Zhou, Hong},
journal={arXiv preprint arXiv:2310.04836},
year={2023}
}
Acknowledgements
Our codes refers to followed projects: GPTQ GPTQ-for-LLaMA AWQ SmoothQuant torch-int fasttransformer