Home

Awesome

<div align="center"> <h1><img src="docs/_static/logo.png" height="28px" /> BMCook</h1>

Model Compression for Big Models

</div> <p align="center"> <a href="#overview">Overview</a> • <a href="#documentation">Documentation</a> • <a href="#install">Installation</a> • <a href="#usage">Usage</a> • <a href="#quick-start">Quick Start</a> • <a href="./README-ZH.md" target="_blank">简体中文</a> <br> </p> <p align="center"> <a href='https://bmcook.readthedocs.io/en/main/'> <img src='https://readthedocs.org/projects/bmcook/badge/?version=main' alt='doc' /> </a> <a href="https://github.com/OpenBMB/BMCook/blob/main/LICENSE"> <img alt="github" src="https://img.shields.io/github/license/OpenBMB/BMCook"> </a> <a> <img alt="version" src="https://img.shields.io/badge/version-0.1.0-blue"> </a> </p>

What's New

<div id="overview"></div>

Overview

BMCook is a model compression toolkit for large-scale pre-trained language models (PLMs), which integrates multiple model compression methods. You can combine them in any way to achieve the desired speedup. Specifically, we implement the following four model compression methods, knowledge distillation, model pruning, model quantization, and model MoEfication. It has following features:

<div id="documentation"></div>

Documentation

Our documentation provides more information about the package.

<div id="install"></div>

Installation

To use BMCook, first install BMTrain.

From PyPI (Recommend)

$ pip install bmtrain

From Source

$ git clone https://github.com/OpenBMB/BMTrain.git
$ cd BMTrain
$ python3 setup.py install

Please refer to the installation guide of BMTrain for more details.

Then, install BMCook.

From PyPI (Recommend)

$ pip install bmcook

From source

$ git clone git@github.com:OpenBMB/BMCook.git
cd BMCook
python3 setup.py install
<div id="usage"></div>

Usage

1. Design your BMCook config.

You should give a json file to state your compress strategy.

{ "distillation": {
    "ce_scale": 0,
    "ce_temp": 1,
      
    "mse_hidn_scale": 0,
    "mse_hidn_module": ['[placehold]'],
    "mse_hidn_proj": false,
      
    "mse_att_scale": 0,
    "mse_att_module": ['[placehold]'],
  },

  "pruning": {
    "is_pruning": false,
    "pruning_mask_path": None,
    "pruned_module": ['[placehold]'],
    "mask_method": "m4n2_1d/m4n2_2d/sprune",
    "sprune": {
        "criterion": "l0",
        "training_mask": ['[placehold]'],
        "fixed_mask_path": "",
        "mask_mode": "train_mask",
        "target_sparsity": 0.5
    }
  },

  "quantization": {
    "is_quant": false,
    "quantized_module": [],
  },

  "MoEfication": {
    "is_moefy": false,
    "first_FFN_module": ['[placehold]'],
  }
}

To notice:

2. Basic usage in your code.

BMCook provides unified interface CookTrainer. BMCook will introduce distillation pruning and MoEfication, which may add some terms to model outputs. You can use it to manage your model, and these modifications.

from bmcook import CookTrainer
from bmcook.utils.config import ConfigParser

#prepare your model, dataloader and optimizer...
...

# setting up your BMCook strategy
CookTrainer.set_compression(cookconfig, model, optimizer, model_distill)

# train
for data in dataloader:
    targets = ...
    ...
    outputs = CookTrainer.forward(model, loss_func, targets, *your_model_inputs, **your_model_kwinputs)

    [loss, model_outputs, lag_loss, sparsity, distill_loss] = outputs

the loss equals to the sum of model_loss, lag_loss and distill_loss. So if you wanna know the model performance, please minus them. Noticed that if sprune is not setted, the lag_loss and loss_func will be None, so do distilling.

model_loss = loss - lag_loss - distill_loss # sprune and distilling both setted.
model_loss = loss - distill_loss # only distilling used. 

BMCook also provides discrete interfaces to initialize compression settings. If you want to design your Trainer for your own needs, you can use these discrete interfaces. Noticed that the output format should keep the same with CookTrainer when you define your own Trainer. For details about extension on CookTrainer, you can refer to CPMAntTrainer.

from bmcook import BMDistill

# Define your own Trainer. 
Trainer = ...

# Set up the distillation
Trainer.forward = BMDistill.set_forward(model, teacher, Trainer.forward, cook_config)

3. How to run your code

You can run your code as normal, but should state where your cookconfig is:

    torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost train.py \
     --save-dir ... \
     --model ... \
     --start-lr ... \
     --cook-config  ... \ # give your cook config path
<div id="quick-start"></div>

Quick Start

The examples folder provides pruning example based on CPM-Live, GPT2-Base, T5-large, please check examples for more details.

Take GPT2 as example:

Quantization-aware training:

    torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost train.py \
     --save-dir results/gpt2-int8 \
     --model gpt2-base \
     --start-lr 1e-4 \
     --cook-config configs/gpt2-int8.json \

Quantization-aware training with knowledge distillation:

    torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost train.py \
     --save-dir results/gpt2-int8-kd \
     --model gpt2-base \
     --start-lr 1e-4 \
     --cook-config configs/gpt2-int8-kd.json \

Model pruning:

    torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost train.py \
     --save-dir results/gpt2-prune \
     --model gpt2-base \
     --start-lr 1e-4 \
     --cook-config configs/gpt2-prune.json \

In this case, we only prune the input embedding layer. You can include more modules by changing the pruned_module field in the config file.

MoEfication (save the hidden states and then use the MoEfication toolkit):

    torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost train.py \
     --save-dir results/gpt2-moe \
     --model gpt2-base \
     --start-lr 1e-4 \
     --cook-config configs/gpt2-moe.json \

Combine quantization, pruning and knowledge distillation:

    torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost train.py \
     --save-dir results/gpt2-combine \
     --model gpt2-base \
     --start-lr 1e-4 \
     --cook-config configs/gpt2-combine.json \

Performances

Based on T5-3B, we evaluate different combinations of compression techniques. The corpus for compression is the Pile. The evaluation datasets includes SST-2, MNLI, and SQuAD. Specifically, we freeze the compressed models and adopt adapter-tuning.

Average PerformanceRelative PerformanceSpeedup
T5-3B0.9258-1x
T5-Base0.879695.0%14x
T5-3B (P+D)0.915098.8%2x
T5-3B (P+D+Q)0.912698.6%8x
T5-3B (P+D+Q+M)0.901797.4%12x

D denotes knowledge distillation. P denotes pruning. Q denotes quantization. M denotes MoEfication.

Comparisons

Model QuantizationModel PruningKnowledge DistillationModel MoEfication
TextPruner---
TensorFlow Lite--
PyTorch--
TextBrewer--
BMCook

Community

We welcome everyone to contribute codes following our contributing guidelines.

You can also find us on other platforms:

License

The package is released under the Apache 2.0 License.

Contributors

We thank Zhengyan Zhang, Baitao Gong, Yingfa Chen, Guoyang Zeng, Jie Zhou, and Zhi Zheng for the contribution. More contributors are welcome!