Awesome
LLMCompression
Official codebase for the EMNLP 2023 findings paper titled "The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models".
We also presented this work at ENLSP 2023 (a workshop at NeurIPS)
Resources: Paper, Twitter Thread (for short summary), Poster, Presentation Slides
Experiments are broadly divided as Encoder-only models, Decoder-only models and Encoder-Decoder models. Compression techniques include Pruning (Sec 4.1), Quantization (Sec 4.2), Pruning+Quantization (Sec 4.3) and Final Dense Layer Pruning (Sec 4.4)
Encoder-only models
Dataset
LAMA is from Facebook (https://github.com/facebookresearch/LAMA) and the Huggingface version has some issues with TRex and GoogleRe. So, we downloaded from the original source and processed it. We followed the BERTNesia setup (https://github.com/jwallat/knowledge-probing).
General Idea
- Load the model and it's corresponding tokenizer (and tokenize the dataset)
- Select the layers that you would like to compress (Attention layers, Feedforward layers or both)
- Pass these layers to the compression technique (pruning/quantization or both) and save the instance of the model
- Run evaluation metrics on the compressed model
Example script: python bert_prune.py 'bert-base-uncased' 'overall_global_pruning' > bert_overall_gp.log
Same idea for RoBERTa, ALBERT, DistilBERT. Files are named as model-name_compression.py
and contains the steps discussed above with model-spefific tokenizers.
Encoder-Decoder and Decoder-Only Models:
Dataset
We used BoolQ, PIQA and Winogrande datasets (present in lm-evaluation-harness, https://github.com/EleutherAI/lm-evaluation-harness) to evaluate these models.
Initial setup
The evaluation harness code-base doesn't natively support Pytorch's Dynamic Quantization (Refer: https://github.com/EleutherAI/lm-evaluation-harness/issues/535), so few steps has to be done to replicate the experiments:
General Idea
- Clone the evaluation-harness repository (https://github.com/EleutherAI/lm-evaluation-harness)
- Because all the models we work are
hf-causal
, the changes are inmodels/huggingface.py
- Add the
get_quantized_layers()
function, the idea is to select the layers that you would like to compress - Inside the init, before loading, check if quantization_flag is None or either of 'all_layers', 'attention_only', 'output_only' and pass the model and selected layers from step 3 to
torch.quantization.quantize_dynamic()
Refer siloed_assets/huggingface.py
as I attach the above changes as a siloed file for reference.
The current code-base has command for only pruning. To run experiments for quantization, go to any of the pruning file and do the following
1. Comment out the function global_pruning_quantize()
2. For Feedforward network: --model_args pretrained={model_name}-{prune_type}-{prune_percentage}, quantization_flag=output_only,no_of_layers={no_of_layers}
2. For Attention Modules: --model_args pretrained={model_name}-{prune_type}-{prune_percentage}, quantization_flag=attention_only,no_of_layers={no_of_layers}
3. For All modules: --model_args pretrained={model_name}-{prune_type}-{prune_percentage}, quantization_flag=all_layers,no_of_layers={no_of_layers}
To run experiments for pruning+quantization, go to any of the pruning file and do the following
1. For Feedforward network: --model_args pretrained={model_name}-{prune_type}-{prune_percentage}, quantization_flag=output_only,no_of_layers={no_of_layers}
2. For Attention Modules: --model_args pretrained={model_name}-{prune_type}-{prune_percentage}, quantization_flag=attention_only,no_of_layers={no_of_layers}
3. For All modules: --model_args pretrained={model_name}-{prune_type}-{prune_percentage}, quantization_flag=all_layers,no_of_layers={no_of_layers}
<b>Note:</b> Vicuna-7B is not completely open-sourced, so the models can't be shared. But the idea should be clear by inspecting wizardlm_prune.py. Just change the model filename once you have downloaded and formatted the Vicuna-7B from Huggingface. Resources:
Please cite our work if it's useful in your research:
@article{namburi2023cost,
title={The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models},
author={Namburi, Satya Sai Srinath and Sreedhar, Makesh and Srinivasan, Srinath and Sala, Frederic},
journal={arXiv preprint arXiv:2312.00960},
year={2023}
}
Feel free to email sgnamburi@wisc.edu for more details and questions. You can also open a Github issue.
Optional TODOs
- Cleanup the code and document better
- Add scripts and organize the codebase better!