Home

Awesome

Outlier Suppression+

Official PyTorch implementation of <a href="https://arxiv.org/abs/2304.09145">Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling</a>

Overview

The Outlier Suppression+ (OS+) effectively suppresses outliers in large language models for better quantization performance without extra inference burden. It first identifies the outlier asymmetric shape across channels and proposes a channel-wise shifting technique with a migration pattern to eliminate it. It then focuses on the outlier concentration phenomenon and proposes to scale down outlier channels toward an elaborate objective.

<p align="center"> <img src="figure/outlier_suppression_plus.png"> </p>

We assess the efficacy of our approach under both standard and fine-grained quantization settings. On standard one, OS+ achieves near-floating-point performance on 8-bit and 6-bit BERT, OPTs, BLOOM, and BLOOMZ. On fine-grained one, OS+ can surpass others by 9.41% on 4-bit LLaMA with per-token quantization and obtain lossless results on 4-bit OPT with per-group quantization.

In the following sections, Support gives supported models and quantization schemes, [Getting Started](#getting started) introduces the whole procedure to run this project including data preparation, quantization, evaluation and updated model export. Evaluation lists configs for each table in the paper for other researchers to reproduce.

Support

Getting Started

Preparation

Quantization

Conduct quantization on specific models and with certain quantization schemes. Here is a simple example for OPT-66B and 8-bit per-tensor symmetric quantization.

# opt.sh
model_size=66b
task_name=winogrande
model_path=model_path
q_config=exp/opt/int8.yaml # q_config.yaml path
is_quant=True # True: quantization; False: fp16

export CUDA_VISIBLE_DEVICES=0,1,2,3
python main.py \
    --model opt \
    --model_args pretrained=$model_path \
    --tasks $task_name \
    --batch_size 16 \
    --no_cache \
    --dtype 'float16' \
    --is_quant ${is_quant} \
    --config ${q_config} \
    2>&1 | tee experiment/opt_${model_size}_${task_name}.log

Model is assigned with --model and --model_args pretrained args. Quantization config is assigned with q_config. The task is assigned with task_name. By running the above command, you will get an accuracy of 69.0.

Export for deployment

As the method only updates weights and biases of the floating-point model, we can easily export a new FP model with weaker outliers, enjoying convenience for further development. Here is an example to export opt-6.7B.

# opt.sh
model_size=6.7b
task_name=winogrande
model_path=model_path 
q_config=exp/opt/int8.yaml # q_config.yaml path
is_quant=True # True: quantization; False: fp16
export CUDA_VISIBLE_DEVICES=0
python main.py \
    --model opt \
    --model_args pretrained=$model_path \
    --tasks $task_name \
    --batch_size 16 \
    --no_cache \
    --dtype 'float16' \
    --is_quant ${is_quant} \
    --is_export \
    --config ${q_config} \
    2>&1 | tee experiment/opt_${model_size}_${task_name}.log
# export.sh
model_size=6.7b   # model size
model_type=opt    # model type
model_path=model_path   # original model path
output_path=output_path # new FP model path
scale_shift_list=exp/opt/scale_shift_list.pth # scaling and shifting values path

export CUDA_VISIBLE_DEVICES=0
python quant_transformer/solver/export.py \
    --model_path $model_path \
    --scale_shift_list $scale_shift_list \
    --model_type $model_type \
    --output_path $output_path

Evaluation

Introduction of config.yaml

quant: 
    a_qconfig: # quantization details for activation
        quantizer: FixedFakeQuantize  # quantizer type
        observer: AvgMinMaxObserver  # calibration methods
        bit: 8 # bit selection
        symmetric: True  # True: symmetric quantization, False: asymmetric one
        ch_axis: -1  # -1: per-layer quantization
    w_qconfig: # quantization details for weight
        quantizer: FixedQuantize # Quantizer type
        observer: MinMaxObserver # calibration methods
        bit: 8 # bit selection
        symmetric: True # True: symmetric quantization, False: asymmetric one
        ch_axis: -1  # 0: per-channel quantization, -1: per-layer one
    calibrate: 128 # calibration size
    calibrate_path: /mnt/lustre/weixiuying.vendor/datasets/nlp_datasets/pile_cali # calibration dataset path, make sure there is _cali in the name
	  except_quantizer: null
    is_remove_padding: True # True: remove [PAD] during calibration
    migrate: True # True: shifting and scaling operations, False: no shifting and scaling operations.
 model:
    max_length: 512 # For PIQA, Winogrande tasks, 512 is enough. For WikiText2, a longer one can improve FP16 results.

Standard quantization with OS+

Fine-grained quantization with OS+

Here, OS+ is combined with fine-grained quantization to validate its wide application and go extremely low bit setting like 4-bit quantization.

Reference

If you find this repo useful for your research, please consider citing the paper:

@article{wei2023outlier,
    title={Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling},
    author={Wei, Xiuying and Zhang, Yunchen and Li, Yuhang and Zhang, Xiangguo and Gong, Ruihao and Guo, Jinyang and Liu, Xianglong},
    journal={arXiv preprint arXiv:2304.09145},
    year={2023}
    }