Home

Awesome

Relative Importance and Activations (RIA)

PLUG-AND-PLAY: AN EFFICIENT POST-TRAINING PRUNING METHOD FOR LARGE LANGUAGE MODELS

Yingtao Zhang<sup>1,2</sup>, Haoli Bai<sup>4</sup>, Haokun Lin<sup>5</sup>, Jialin Zhao<sup>1,2</sup>, Lu Hou<sup>4</sup>, & Carlo Vittorio Cannistraci<sup>1,2,3</sup>

<sup>1</sup> Center for Complex Network Intelligence, Tsinghua Laboratory of Brain and Intelligence
<sup>2</sup> Department of Computer Science, Tsinghua University
<sup>3</sup> Department of Biomedical Engineering, Tsinghua University
<sup>4</sup> Huawei Noah’s Ark Lab
<sup>5</sup> Institute of Automation, Chinese Academy of Sciences

Corresponding to {zhangyingtao1024, kalokagathos.agon}@gmail.com

Setup

Step 1: Create a new conda environment:

conda create -n ria python=3.10
conda activate ria

Step 2: Install relevant packages

pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/nightly/cu121

Step 3: install lm-evaluation-harness (if --eval_zero_shot)

Follow the installation here: https://github.com/EleutherAI/lm-evaluation-harness

Usage

RIA with unstructured 50% sparsity

python main.py \
	--model YOUR_MODEL_NAME \
	--prune_method ria \
	--sparsity_ratio 0.5 \
	--sparsity_type unstructured \
	--save \

Here the prune_method can be replaced with wanda, sparsegpt, ri, magnitude

RIA with semi-structured sparsity

python main.py \
	--model YOUR_MODEL_NAME \
	--prune_method ria \
	--sparsity_ratio 0.5 \
	--sparsity_type 2:4 \
	--save \

sparsity_type can be any type of semi-structured sparsity pattern, for instance: 1:4, 2:4.

Enable --reallocation if you want to use heuristic channel reallocation.

Enable --lsa if you want to further finetune the channels after reallocation with linear sum assignment.

Enable --fast if you want to use a fast version of linear sum assignment.

End-to-End inference speedup with semi-structured sparsity


Currently, this repo only supports the acceleration after direct N:M sparsity. The acceleration of N:M sparsity after channel permutation is still under testing.

python main.py \
	--model YOUR_MODEL_NAME \
	--prune_method ria \
	--sparsity_ratio 0.5 \
	--sparsity_type 2:4 \
	--semi_sparse_acc \
	--save \

Requirements:

Make sure that your GPU support cusparselt, otherwise please set

SparseSemiStructuredTensor._FORCE_CUTLASS = True

Which force to use CUTLASS

Additional Experimental Results on LLaMA3


LLaMA3-8B on Wikitext2: fully connected: PPL 6.14

50% unstructured sparsity2:42:4+Channel Permutation4:8
Magnitude2499.39
Relative Importance135.77
Wanda10.8224.1822.03
Sparsegpt9.4016.2612.13
Ria9.3423.0820.05

LLaMA3-70B on Wikitext2: fully connected: PPL 2.85

50% unstructured sparsity2:42:4+Channel Permutation4:8
Magnitude19.11
Relative Importance6.09
Wanda6.569.28
Sparsegpt5.79
Ria5.498.35

Acknowledgment


This repository is built upon the SparseGPT and Wanda repository.

Citation


If you use our code, please consider to cite:

@inproceedings{
zhang2024plugandplay,
title={Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models},
author={Yingtao Zhang and Haoli Bai and Haokun Lin and Jialin Zhao and Lu Hou and Carlo Vittorio Cannistraci},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=Tr0lPx9woF}
}