Awesome

The LLM Surgeon

Abstract

Code for paper The LLM Surgeon. The code was based of the GitHub repo for the 2023 ICML paper SparseGPT and uses the same data and evaluation pipeline for fair comparison.

State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative to training smaller models from scratch. To do so, we scale Kronecker-factored curvature approximations of the target loss landscape to large language models. In doing so, we can compute both the dynamic allocation of structures that can be removed as well as updates of remaining weights that account for the removal. We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights, while remaining computationally efficient. Experimentally, our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance, and achieve state-of-the-art results in unstructured and semi-structured pruning of large language models.

Getting started

Installation

First, let's define the path where we want the repository to be downloaded:

REPO_PATH=<path/to/repo>

Now we can clone the repository:

git clone git@github.com:Qualcomm-AI-research/<...>.git $REPO_PATH
cd $REPO_PATH

Next, create a virtual environment. Minimum Python version required is 3.6; this code-base has been tested with Python 3.8.1.

python3 -m venv env
source env/bin/activate
pip install --upgrade --no-deps pip

Finally, install the dependencies using pip from requirements.txt:

pip install -r requirements.txt

Getting access to the models:

Request a Huggingface token
On line 30 in surgeon.py, replace # TODO add token here with the Huggingface token

We are now ready to run the experiments!

Example Usage

Structured LLM Surgeon

python surgeon.py --model_str facebook/opt-125m --sparsity 0.5 --curvature kfac --structures row column

Semi-structured LLM Surgeon

python surgeon.py --model_str facebook/opt-125m  --sparsity 0.5 --curvature kfac --structures 2:4 --shots 5 --use_iad --max_correlate 2000 --addupdate --damp_g 0.1

Unstructured LLM Surgeon

python surgeon.py --model_str facebook/opt-125m --sparsity 0.5 --curvature kfac --structures element

Examples of different structure types:

Main settings

Argument	Default	Information
`--model_str`	`facebook/opt-125`	Default model_str
`--sparsity`	`0.5`	Target sparsity level determines fraction of weights that are removed.
`--curvature`	`kfac`	Type of loss curvature approximation. Choose from: `identity`, `activations` or `kfac` (recommended).
`--structures`	`row column`	Structured `row column`, semi-structured `2:4`/`4:8`, or unstructured `element` pruning.
`--obd`	`None`	Assume all weights are independent, ignoring off-diagonal terms. Only remove weights. (not recommended)
`--use_iad`	`None`	Assume independence of activations and derivatives of classic KFAC. (recommended)
`--shots`	`40`	Amount of shots in pruning schedule.
`--fisher_samples`	`0`	Amount of samples from Fisher, or Empirical Fisher (EF) at setting `0`.
`--max_correlate`	`0`	Maximum number of weight correlations in a layer per shot, or correlate everything at setting `0` (recommended), or correlate rows `-1`.
`--krank`	`1`	Amount of Kronecker factors to sum in kfac curvature estimate, default is `1` (recommended).
`--diagonal`	`None`	Also use a diagonal curvature estimate (recommended).

For a full overview of available settings, run python surgeon.py --help or check out surgeon.py.

Reproducibility

Reproduce results from paper

Scripts to reproduce all results in the paper can be found in the scripts/ directory.

In these scripts, ensure to set the $LLAMA_V2_ROOT, $PROJECT_ROOT and $LOG_ROOT variables, or ensure that these are set in the environment. Alternatively, instead of setting the variable $LOG_ROOT, the root log directory can be passed as --log_root to surgeon.py.

Run other methods

Our code offers a general framework for pruning algorithms. A few examples of pruning methods:

# Magnitude pruning
python surgeon.py --model_str facebook/opt-125m --sparsity 0.5 --curvature identity --obd --structures row column

# L-OBD
python surgeon.py --model_str facebook/opt-125m --sparsity 0.5 --curvature activations --obd --structures row column

# Kron-OBD
python surgeon.py --model_str facebook/opt-125m --sparsity 0.5 --curvature kfac --obd --structures row column

# Structured LLM Surgeon
python surgeon.py --model_str facebook/opt-125m --sparsity 0.5 --curvature kfac --structures row column

# Unstructured LLM Surgeon
python surgeon.py --model_str facebook/opt-125m --sparsity 0.5 --curvature kfac --structures element

Extendability

The codebase is designed to be easily extendable to other curvature estimates and pruning algorithms.

File	Contains
`surgeon.py`	Main script
`sparsify.py`	General outer loop to prune large language model.
`curvature.py`	Contains curvature types inheriting `Curvature` class.
`pruner.py`	Contains computation of losses and weight updates from curvatures.
`threshold.py`	Efficient implementation to compute global threshold from local layer-wise losses to match target sparsity level. Takes into account overlap between rows and columns and layer sizes.
`eval.py`	Code for evaluation pass
`lora.py`	Code for LoRA finetuning
`utils.py`	Contains helper functions. Most imporantly, `sub_to_full` and `full_to_sub` to only do operations on relevant sub-matrices.
`datautils.py`	Contains helper functions for data loading.

References

If you found this code useful, please be sure to cite:

@inproceedings{van2023llm,
 title={The LLM Surgeon},
 author={van der Ouderaa, Tycho FA and Nagel, Markus and van Baalen, Mart and Asano, Yuki M and Blankevoort, Tijmen},
 booktitle={International Conference of Learning Representations (ICLR)},
 year={2024}
}