Home

Awesome

Quamba: A Post-Training Quantization Recipe for Selective State Space Models

Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Diana Marculescu

arXiv Project Page

⚡8-bit quantization (W8A8) for mamba blocks 🚀1.7 $\times$ speedup on Orin Nano 8G 🔻 2 $\times$ memory reduction Quamba

Real-time Generation on a NVIDIA Orin Nano 8G

Quamba

Setup

Hardware Requirements

Software Requirements

Clone Quamba

git clone --recurse-submodules git@github.com:enyac-group/Quamba.git

To build the docker image with customized kernels, run the following commands:

cd docker
./build_docker.sh
./run.sh # launch the container

Or Pull the pre-built docker image by

docker image pull hychiang/quamba-cuda-12.1:latest
cd Quamba
conda create -n quamba python=3.10
conda activate quamba
pip install -r requirements.txt

Build 3rd-party Libraries

# set force build to include 12N, 40N from the newer commit
export FAST_HADAMARD_TRANSFORM_FORCE_BUILD=TRUE
pip install 3rdparty/fast-hadamard-transform
# lm_eval-0.4.2 word2number-1.1
pip install 3rdparty/lm-evaluation-harness
# set force build to use the commit for Quamba
export MAMBA_FORCE_BUILD=TRUE
pip install 3rdparty/mamba
# cmake version >= 3.22.1
bash build_cutlass.sh

Build Quamba

pip install .

Generate

To generate the sentence from Mamba (FP16) given an input prompt:

python generate.py state-spaces/mamba-130m --prompt "My cat wrote all this CUDA code for a new language model and" --topp 0.9 --temperature 0.7 --repetition_penalty 1.2

To generate the sentence from Qamba (Int8) given an input prompt:

python generate.py state-spaces/mamba-130m --prompt "My cat wrote all this CUDA code for a new language model and" --topp 0.9 --temperature 0.7 --repetition_penalty 1.2 --quantize --act_scales_cache mamba-130m_scales.pt

Chat

To chat with Mamba (FP16), use the command:

python chat.py  --cache_graph

To chat with Quamba (Int8), use the command:

python chat.py  --cache_graph --act_scales_cache mamba-2.8b_scales_chat.pt  --quantize

Profile latency and memory

python profile_mamba.py state-spaces/mamba-2.8b  --act_scales_cache mamba-2.8b_scales.pt --prompt_len 512 --ttft
python profile_mamba.py state-spaces/mamba-2.8b  --act_scales_cache mamba-2.8b_scales.pt --tpot
python profile_mamba.py state-spaces/mamba-2.8b  --act_scales_cache mamba-2.8b_scales.pt --prompt_len 512 --gen_len 512 --ttlt
python profile_mamba.py state-spaces/mamba-2.8b  --act_scales_cache mamba-2.8b_scales.pt --prompt_len 512 --gen_len 512 --size

Fake Quantization Evaluation

To evaluate the simulated quantization:

python main.py state-spaces/mamba-130m fake \
--do_hadamard \
--do_percentile_u \
--batch_size 16 \
--task_list lambada_openai \
--eval_zero_shot \
--log_dir logs

Real Quantization Evaluation

To evaluate the end-to-end quantization:

python main.py state-spaces/mamba-130m real \
--act_scales_cache mamba-130m_scales.pt \
--batch_size 1 \
--task_list lambada_openai \
--eval_zero_shot \
--log_dir logs

Citation

@article{chiang2024quamba,
  title={Quamba: A Post-Training Quantization Recipe for Selective State Space Models},
  author={Chiang, Hung-Yueh and Chang, Chi-Chih and Frumkin, Natalia and Wu, Kai-Chiang and Marculescu, Diana},
  journal={arXiv preprint arXiv:2410.13229},
  year={2024}
}