Awesome

Awesome ML Model Compression

An awesome style list that curates the best machine learning model compression and acceleration research papers, articles, tutorials, libraries, tools and more. PRs are welcome!

Papers
Articles
- Howtos
- Assorted
- Reference
- Blogs
Tools
- Libraries
- Frameworks
Videos
- Talks
- Training & tutorials

Papers

General

A Survey of Model Compression and Acceleration for Deep Neural Networks
Model compression as constrained optimization, with application to neural nets. Part I: general framework
Model compression as constrained optimization, with application to neural nets. Part II: quantization
Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better
FP8 Formats for Deep Learning by NVIDIA, Arm, and Intel, 2022 - FP8 delivered the performance of INT8 with accuracy of FP16. E4M3, a variant of FP8 has the benefits of INT8 with none of the loss in accuracy and throughput.

Architecture

Quantization

Quantized Convolutional Neural Networks for Mobile Devices
Towards the Limit of Network Quantization
Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
Compressing Deep Convolutional Networks using Vector Quantization
Trained Ternary Quantization
The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning
ShiftCNN: Generalized Low-Precision Architecture for Inference of Convolutional Neural Networks
Deep Learning with Low Precision by Half-wave Gaussian Quantization
Loss-aware Binarization of Deep Networks
Quantize weights and activations in Recurrent Neural Networks
Fixed-Point Performance Analysis of Recurrent Neural Networks
And the bit goes down: Revisiting the quantization of neural networks
8-bit Optimizers via Block-wise Quantization
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [blog post]
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models by MIT and NVIDIA (2022) [code]
ZeroQuant: Efficient and Affordable Post-training Quantization for Large Transformer-based Models by Microsoft (2022) [code]
nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models by NAVER CLOVA and Pohang University of Science and Technology, Korea (2022)
MKQ-BERT: Quantized BERT with 4-bits Weights and Activations by Tencent AIPD (2022)
Understanding and Overcoming the Challenges of Efficient Transformer Quantization by Qualcomm AI Research (2021) [code]
Mesa: A Memory-saving Training Framework for Transformers by Monash University (2021)
The case for 4-bit precision: k-bit Inference Scaling Laws by Tim Dettmers et al. (2022) - Overall, their findings show that 4-bit precision is almost universally optimal for total model bits and zero-shot accuracy.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers by Elias Frantar et al., 2022.
Other lists:
- htqin/awesome-model-quantization

Binarization

Pruning

Faster CNNs with Direct Sparse Convolutions and Guided Pruning
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
Pruning Convolutional Neural Networks for Resource Efficient Inference
Pruning Filters for Efficient ConvNets
Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning
Learning to Prune: Exploring the Frontier of Fast and Accurate Parsing
Fine-Pruning: Joint Fine-Tuning and Compression of a Convolutional Network with Bayesian Optimization
Learning both Weights and Connections for Efficient Neural Networks
ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression
Data-Driven Sparse Structure Selection for Deep Neural Networks
Soft Weight-Sharing for Neural Network Compression
Dynamic Network Surgery for Efficient DNNs
Channel pruning for accelerating very deep neural networks
AMC: AutoML for model compression and acceleration on mobile devices
ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA
Massive Language Models Can Be Accurately Pruned in One-Shot (2023) - Pruning methods: post-training, layer-wise. Quantization methods: joint sparsification & post-training quantization.

They propose SparseGPT, the first accurate one-shot pruning method which works efficiently at the scale of models with 10-100 billion parameters. SparseGPT works by reducing the pruning problem to an extremely large-scale instance of sparse regression. It is based on a new approximate sparse regression solver, used to solve a layer-wise compression problem, which is efficient enough to execute in a few hours on the largest openly-available GPT models (175B parameters), using a single GPU. At the same time, SparseGPT is accurate enough to drop negligible accuracy post-pruning, without any fine-tuning.
UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers by Tsinghua University et al. (ICML 2023) [Code]
A Simple and Effective Pruning Approach for Large Language Models by CMU, Meta AI Research et al. (May 2024) - The popular approach known as magnitude pruning removes the smallest weights in a network based on the assumption that weights closest to 0 can be set to 0 with the least impact on performance. In LLMs, the magnitudes of a subset of outputs from an intermediate layer may be up to 20x larger than those of other outputs of the same layer. Removing the weights that are multiplied by these large outputs — even weights close to zero — could significantly degrade performance. Thus, a pruning technique that considers both weights and intermediate-layer outputs can accelerate a network with less impact on performance. Why it matters: The ability to compress models without affecting their performance is becoming more important as mobiles and personal computers become powerful enough to run them. [Code: Wanda]

Distillation

Low Rank Approximation

Speeding up convolutional neural networks with low rank expansions
Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications
Convolutional neural networks with low-rank regularization
Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation
Accelerating Very Deep Convolutional Networks for Classification and Detection
Efficient and Accurate Approximations of Nonlinear Convolutional Networks
LoRA: Low-Rank Adaptation of Large Language Models - Low-rank adapters were proposed for GPT-like models by Hu et al.
QLoRA: Efficient Finetuning of Quantized LLMs by Tim Dettmers et al. (2023)

Offloading

Recent years have witnessed the emergence of systems that are specialized for LLM inference, such as FasterTransformer (NVIDIA, 2022), PaLM inference (Pope et al., 2022), Deepspeed-Inference (Aminabadi et al., 2022), Accelerate (HuggingFace, 2022), LightSeq (Wang et al., 2021), TurboTransformers (Fang et al., 2021).

To enable LLM inference on easily accessible hardware, offloading is an essential technique — to our knowledge, among current systems, only Deepspeed-Inference and Huggingface Accelerate include such functionality.

FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU by HazyResearch@Stanford et al., 2023. [Tweet]

Parallelism

Compression methods for model acceleration (i.e., model parallelism) papers:

Does compressing activations help model parallel training? (2023) - They presents the first empirical study on the effectiveness of compression algorithms (pruning-based, learning-based, and quantization-based - using a Transformer architecture) to improve the communication speed of model parallelism. Summary: 1) activation compression not equal to gradient compression; 2) training setups matter a lot; 3) don't compress early layers' activation.

Articles

Content published on the Web.

Howtos

How to Quantize Neural Networks with TensorFlow
🤗 PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware (2023) - The Hugging Face PEFT library enables using the most popular and performant models from Transformers coupled with the simplicity and scalability of Accelerate. Currently supported PEFT methods: LoRA, prefix tuning, prompt tuning, and P-Tuning (which employs trainable continuous prompt embeddings). They'll be exploring more PEFT methods, such as (IA)3 and bottleneck adapters. Results: The number of parameters needed to fine-tune Flan-T5-XXL is now 9.4M, about 7X fewer than AlexNet (source: Tweet).

Assorted

Why the Future of Machine Learning is Tiny
Deep Learning Model Compression for Image Analysis: Methods and Architectures
A foolproof way to shrink deep learning models by MIT (Alex Renda et al.) - A pruning algorithm: train to completion, globally prune the 20% of weights with the lowest magnitudes (the weakest connections), retrain with learning rate rewinding for the original (early training) rate, iteratively repeat until the desired sparsity is reached (model is as tiny as you want).

Reference

Blogs

A Visual Guide to Quantization - Demystifying the Compression of Large Language Models by Maarten Grootendorst (Jul 2024) - An approachable and great primer into quantization and widely supported quantization methods in tools and libraries including GPTQ, GGUF, and BitNet (1-bit).
Overview of natively supported quantization schemes in 🤗 Transformers (Sept 2023)
TensorFlow Model Optimization Toolkit — Pruning API
Compressing neural networks for image classification and detection - Facebook AI researchers have developed a new method for reducing the memory footprint of neural networks by quantizing their weights, while maintaining a short inference time. They manage to get a 76.1% top-1 ResNet-50 that fits in 5 MB and also compress a Mask R-CNN within 6 MB.
All The Ways You Can Compress BERT - An overview of different compression methods for large NLP models (BERT) based on different characteristics and compares their results.
Deep Learning Model Compression methods.
Do We Really Need Model Compression in the future?
Quantization: Breakdown of Nvidia H100s for Transformer Inferencing by Carol Chen, ML ops at Cohere.

Transformer Engine utilizes FP8 and FP16 together to reduce memory usage and increase performance while still maintaining accuracy for large language models.
Comparison between quantization techniques and formats for LLMs (Oct 2023)- A detailed comparison between GGUF (llama.cpp), GPTQ, AWQ, EXL2, q4_K_M, q4_K_S, and load_in_4bit: perplexity, VRAM, speed, model size, and loading time.
Which Quantization Method Is Best for You?: GGUF, GPTQ, or AWQ (Jan 2024) - A gentle introduction to three prominent quantization methods — GPTQ, AWQ, and GGUF.
Comparing Quantized Performance in Llama Models (Jul 2024) - 8 bit quantized seems fine, for 4 bit it depends. It covers different quantization schemes including GGUF, HQQ (Half-Quadratic Quantization), AWQ, GPTQ, and BnB.

Tools

Libraries

TensorFlow Model Optimization Toolkit. Accompanied blog post, TensorFlow Model Optimization Toolkit — Pruning API
XNNPACK is a highly optimized library of floating-point neural network inference operators for ARM, WebAssembly, and x86 (SSE2 level) platforms. It's a based on QNNPACK library. However, unlike QNNPACK, XNNPACK focuses entirely on floating-point operators.
Bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers and quantization functions.
NNCP - An experiment to build a practical lossless data compressor with neural networks. The latest version uses a Transformer model (slower but best ratio). LSTM (faster) is also available.

Frameworks

Paper Implementations

facebookresearch/kill-the-bits - code and compressed models for the paper, "And the bit goes down: Revisiting the quantization of neural networks" by Facebook AI Research.

Videos

Talks

Training & tutorials

License

I am providing code and resources in this repository to you under an open source license. Because this is my personal repository, the license you receive to my code and resources is from me and not my employer.

Text content: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)