Home

Awesome

<h1 align="center">ABQ-LLM</h1 <p align="center">ABQ-LLM is a novel arbitrary bit quantization scheme that achieves excellent performance under various quantization settings while enabling efficient arbitrary bit computation at the inference level. </p>

ABQ-LLM The current release version supports the following features:

Contents

Install

Installation of the algorithmic runtime environment

conda create -n abq-llm python=3.10.0 -y
conda activate abq-llm
git clone https://github.com/bytedance/ABQ-LLM.git
cd ./ABQ-LLM/algorithm
pip install --upgrade pip 
pip install -r requirements.txt

Installation of the inference engine environment

You can actually compile and test our quantized inference Kernel, but you need to install the basic CUDA Toolkit.

  1. Install CUDA Toolkit (11.8 or 12.1, linux or windows). Use the Express Installation option. Installation may require a restart (windows).
  2. Clone the CUTLASS. (It is only used for speed comparison)
git submodule init 
git submodule update

ABQ-LLM Model

We provide pre-trained ABQ-LLM model zoo for multiple model families, including LLaMa-1&2, OPT. The detailed support list:

ModelsSizesW4A16W3A16W2A16W2A16g128W2A16g64
LLaMA7B/13B
LLaMA-27B/13B
ModelsSizesW8A8W4A8W6A6W4A6W4A4W3A8W3A6W2A8W2A6
LLaMA7B/13B
LLaMA-27B/13B

Usage

Algorithm Testing

We provide the pre-trained ABQ- LLM model weight in hugginface, you can verify the model performance by the following commands.

CUDA_VISIBLE_DEVICES=0 python run_pretrain_abq_model.py \
--model /PATH/TO/LLaMA/llama-7b-ABQ \
--wbits 4 --abits 4

We also provide full script to run ABQ-LLM in ./algorithm/scripts/. We use LLaMa-7B as an example here:

  1. Obtain the channel-wise scales and shifts required for initialization:
python generate_act_scale_shift.py --model /PATH/TO/LLaMA/llama-7b
  1. Weight-only quantization
# W3A16
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w3a16 \
--eval_ppl --wbits 3 --abits 16  --lwc --let

# W3A16g128
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w3a16g128 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc --let
  1. weight-activation quantization
# W4A4
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w4a4 \
--eval_ppl --wbits 4 --abits 4 --lwc --let \
--tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

More detailed and optional arguments:

Kernel Benchmark

  1. Compile Kernels.

By default, w2a2, w3a3, w4a4, w5a5, w6a6, w7a7, w8a8 are compiled, and the kernel of w2a4, w2a6, w2a8, and w4a8 quantization combination is compiled. Each quantization scheme corresponds to dozens of kernel implementation schemes to build its search space.

# linux
cd engine
bash build.sh

# windows
cd engine
build.bat
  1. Comprehensive benchmark.

For the typical GEMM operation of the llama model, different quantization combinations (w2a2, w3a3, w4a4,w5a5, w6a6, w7a7, w8a8, w2a4, w2a6, w2a8, w4a8) are tested to obtain the optimal performance in the search space of each quantization combination.

# linux
bash test.sh
# windows
test.bat
  1. Add new quantization combinations(Optional).

We reconstructed the quantized matrix multiplication operation in a clever way, decomposing it into a series of binary matrix multiplications, and performed a high degree of template and computational model abstraction.

Based on the above optimizations, you can quickly expand our code to support new quantization combinations, such as wpaq. You only need to add wpaq instantiation definition and declaration files in engine/mma_any/aq_wmma_impl and then recompile.

The performance upper limit depends on how the search space is defined (the instantiated function configuration). For related experience, please refer to the paper or the existing implementation in this directory.

E2E Benchmark

  1. Compile the fastertransformer
cd fastertransformer
bash build.sh
  1. Config llama (Change precision in examples/cpp/llama/llama_config.ini)
fp16:  int8_mode=0
w8a16: int8_mode=1
w8a8:  int8_mode=2
w4a16: int8_mode=4
w2a8:  int8_mode=5
  1. Run llama on single GPU
cd build_release
./bin/llama_example
  1. (Optional) Run in multi GPU. Change tensor_para_size=2 in examples/cpp/llama/llama_config.ini
cd build_release
mpirun -n 2 ./bin/llama_example

Results

Related Project

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers

RPTQ: Reorder-Based Post-Training Quantization for Large Language Models

OmniQuant is a simple and powerful quantization technique for LLMs

Citation

If you use our ABQ-LLM approach in your research, please cite our paper:

@article{zeng2024abq,
  title={ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models},
  author={Zeng, Chao and Liu, Songwei and Xie, Yusheng and Liu, Hong and Wang, Xiaojian and Wei, Miao and Yang, Shu and Chen, Fangmin and Mei, Xing},
  journal={arXiv preprint arXiv:2408.08554},
  year={2024}
}

Star History

Star History Chart