Awesome
EvoPress
Code for paper EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search.
Usage
Repository structure
scripts/
— contains bash scripts with the required arguments to run the methodsrc/
— directory for helper methods and utility functionsevo_drop_search.py
— evolutionary depth pruningdrop_scoring.py
— scoring based baseline methods for depth pruningbrute_force_drop.py
— brute force depth pruningevo_prune_search.py
— evolutionary unstructured sparsity allocationprune.py
— SparseGPT unstructured pruning (preparation of database for EvoPress)owl_prune.py
— SparseGPT unstructured pruning (preparation of database for OWL)evo_quant_search.py
— evolutionary quantization bitwidth allocationquant.py
— GPTQ quantization (preparation of database for EvoPress)compute_layer_errors.py
— compute NMSE for Dynamic Programming (DP) solverdp_search.py
— script to run DP solver on top of configuration produced bycompute_layer_errors.py
lmeval.py
— LM Eval Harness evalution scripteval_ppl.py
— perplexity evalution script
Calibration data
We provide 3 options for calibration data: wikitext2
, c4
, fineweb_edu
.
We recommend using the latter one for the best results. In our experiments we used 8M tokens
for calibration. To prepare a specific amount of calibration data specify
--calibration_tokens
. By default we trim the calibration sequence length to the maximal context length.
However, for some models, context length may be too long to fit into memory. We
set --calibration_sequence_length
to 8k
for models with context length >=8k
.
In experiments we used --calibration_tokens=2^23
and --calibration_sequence_length=8192
for Llama-3-8B, Llama-3.1-8B, Phi-3-medium-128k-instruct, and --calibration_sequence_length=4096
for Llama-2-7b.
Multi-GPU
Some of the scripts (Unstructured Sparsity, Quantization) may operate in distributed mode
for faster execution. We recommend using torchrun
to launch them:
torchrun --nnodes=1 --nproc-per-node=<NUM_GPU> <name_of_the_script.py> <args...>
Depth pruning
We provide 3 versions for depth pruning:
evo_drop_search.py
— depth pruning via EvoPressdrop_scoring.py
— depth pruning via scoring methodsbrute_force_drop.py
— depth pruning via brute force
To run EvoPress for depth pruning, execute run_drop_search.sh
in the scripts folder.
Unstructured Sparsity
We provide 2 version for unstructured pruning:
prune.py
— SparseGPT unstructured pruning (preparation of database for EvoPress)owl_prune.py
— SparseGPT unstructured pruning (preparation of database for OWL)
To run EvoPress for non-uniform unstructured pruning, first execute run_sparse_gpt.sh
to generate the database and then run_prune_search.sh
for the search.
Quantization
We provide quant.py
for producing the GPTQ database for EvoPress.
To run EvoPress for non-uniform quantization, first execute run_gptq.sh
to generate the database and then run_quant_search.sh
for the search.
Evaluation
We provide lmeval.py
and eval_ppl.py
scripts for evaluation on Language Model Evaluation Harness benchmarks and perplexity measurements. The interface of lmeval.py
mostly follows the instructions from the original. In addition, one should specify the path to sparse/quantized weights via --sparse-weights-path
/--quant-weights-path
argument and path to .txt
with chosen compression levels via --sparse-config-path
/--quant-config-path
argument. We adopted lm-eval==0.4.0
for evaluation.
Environment
This code was tested on the following environment:
pytorch 2.4.0 py3.10_cuda12.1_cudnn9.1.0_0 pytorch
pytorch-cuda 12.1 ha16c6d3_5 pytorch
cuda 12.1.0 0 nvidia/label/cuda-12.1.0
transformers 4.43.4 pypi_0 pypi
datasets 2.21.0 pypi_0 pypi
lm-eval 0.4.0 pypi_0 pypi
Notes
Scripts prune.py
, owl_prune.py
, quant.py
produce several versions of compressed representation
for each weight (100-200 Gb)
. Make sure that you have sufficient amount of free space on drive before running. Additionally, when using KL-Divergence as the fitness function for the search, ensure you have enough RAM to store the logits, particularly for the models with 128K vocabulary size. Alternatively, we implemented TopK-KL-Divergence in evo_quant_search.py
, which significantly reduces memory requirements. Preliminary experiments have shown this method to be comparably effective to KL-Divergence for $K \geq 512$.