Home

Awesome

LoSparse

<img src="asset/LoSparse_logo.png" alt="LoSparse_logo" style="zoom:10%;" />

This pytorch package implements LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation (ICML 2023).

Overview

A highly efficient compression method combining structured pruning and low-rank approximation.

We use a low-rank matrix which can be decomposed into two small matrices and a structured sparse matrix to approximate the weight matrix during the downstream task fine-tuning. The diagram is illustrated in below:

<img src="asset/LoSparse_diagram.png" alt="LoSparse_diagram" style="zoom:10%;" />

Main Results

DeBERTa-v3-base on GLUE w/o knowledge distillation

RatioMNLIRTEQNLIMRPCQQPSST-2CoLASTS-B
100%*90.5/90.682.094.089.5 / 93.392.4/89.895.369.291.6/91.1
20%84.5/83.868.088.685.0/89.490.6/87.291.750.088.8/88.5
15%83.3/82.966.987.683.6/88.090.3/87.090.446.887.7/87.3
10%81.7/81.866.086.182.3/87.489.5/86.089.240.087.2/87.0

*: full model fine-tuning

DeBERTa-v3-base on SQuAD v1.1 w/o knowledge distillation

Ratio5%10%20%30%40%50%
69.3/79.172.9/82.876.8/85.880.2/88.082.1/89.482.3/90.3

BART-large on CNN/DailyMail and XSum w/o knowledge distillation

RatioXSumCNN/DailyMail
Lead-3*16.30/1.60/11.9540.42/17.62/36.67
100%**45.14/22.27/37.2544.16/21.28/40.90
50%39.18/16.91/31.6241.54/19.04/38.58
40%38.30/16.02/30.7241.42/19.00/38.47
30%37.41/15.42/30.0241.21/18.84/38.21

*: Using the first 3 sentences in the document as the summary

**: full model fine-tuning

Train

We use huggingface 🤗 as our training code scripts. See examples here

Requirements

pip install -r requirements.txt

Training Files

An example command for training on GLUE dataset is:

python run_glue.py \
  --dataset_name \
  

We provided sample training script in: train_glue.sh, train_qa.sh, train_summarization.sh. Additionally, we provide the distillation script for the glue experiments in train_glue_distil.sh. You can also do quick eval on some sample checkpoints we release by substituting the path for eval_checkpoint in eval_glue.sh and eval_glue_distil.sh.

The checkpoints are shown below

DeBERTa-v3-base and BERT-base on GLUE w/o knowledge distillation

Model NameTASKParameter ratioPerformance
deberta_mnli_20MNLI2084.6
deberta_mnli_15MNLI1583.3
deberta_mnli_10MNLI1081.6
deberta_rte_20RTE2069.0
deberta_rte_15RTE1567.1
deberta_rte_10RTE1066.8
deberta_cola_20COLA2050.7
deberta_cola_15COLA1546.6
deberta_cola_10COLA1040.6
deberta_stsb_20STSB2089.0 / 88.6
deberta_stsb_15STSB1587.9 / 87.5
deberta_stsb_10STSB1087.2 / 86.8
bert_rte_20RTE2066.1.7
bert_rte_15RTR1564.6
bert_rte_10RTE1063.2

BERT-base on GLUE with knowledge distillation

Model NameTASKParameter ratioPerformance
bert_mnli_25_distilMNLI2584.6
bert_mnli_50_distilMNLI5085.1
bert_rte_50_distilRTE5075.8

You can easily download the checkpoints through wget command

Arguments

Main experiment arguments

Other experiment arguments

Plug to your code!

3 steps to apply our method to your code. Make sure you have import utils first.

Step 1: Replace Matrices

Insert the following code after loaded the pre-training model

# Substitute weights with low rank matrix and sparse matrix
allow_name = ['query', 'key', 'value', 'q_proj', 'k_proj', 'v_proj', 'out_proj', 'dense', 'attention', 'fc1', 'fc2']
block_name = ['pooler', 'classifier', 'LayerNorm', 'embeddings']

utils.substitute_layer_weights(module=model,
                               allow_name=allow_name,
                               block_name=block_name,
                               parameter_ratio=args.low_rank_parameter_ratio,
                               do_svd=True)

Step 2: Setup Pruner

Insert the following code anywhere before the training loop

pruner = utils.Pruner(model=model, 
                      args=args, 
                      total_step=args.max_train_steps,
                      mask_param_name=['sparse'], 
                      pruner_name='PLATON',
                      structured_method=args.structured_method,
                      structured_direction=args.structured_direction)

Step 3: Prune During Training

Insert threshold, mask_threshold = pruner.update_and_pruning(model, completed_steps) after loss.backward() but before optimizer.zero_grad(). For example:

for batch in dataloader:
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()
    optimizer.step()

    # Prune the sparse matrix
    threshold, mask_threshold = pruner.update_and_pruning(model, completed_steps)

    lr_scheduler.step()
    optimizer.zero_grad()