Awesome

Just helping myself keep track of LLM papers that I‘m reading, with an emphasis on inference and model compression.

Transformer Architectures

Foundation Models

Position Encoding

KV Cache

Activation

Pruning

Optimal Brain Damage (1990)
Optimal Brain Surgeon (1993)
Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning (Jan. 2023) - Introduces Optimal Brain Quantization based on the Optimal Brain Surgeon
Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
A Simple and Effective Pruning Approach for Large Language Models - Introduces Wanda (pruning with Weights and Activations)

Quantization

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale - Quantization with outlier handling. Might be solving the wrong problem - see "Quantizable Transformers" below.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - Another approach to quantization with outliers
Up or Down? Adaptive Rounding for Post-Training Quantization (Qualcomm 2020) - Introduces AdaRound
Understanding and Overcoming the Challenges of Efficient Transformer Quantization (Qualcomm 2021)
QuIP: 2-Bit Quantization of Large Language Models With Guarantees (Cornell Jul. 2023) - Introduces incoherence processing
SqueezeLLM: Dense-and-Sparse Quantization (Berkeley Jun. 2023)
Intriguing Properties of Quantization at Scale (Cohere May 2023)
Pruning vs Quantization: Which is Better? (Qualcomm Jul. 2023)

Normalization

Root Mean Square Layer Normalization
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing - Introduces gated attention and argues that outliers are a consequence of normalization

Sparsity and rank compression

Compressing Pre-trained Language Models by Decomposition - vanilla SVD composition to reduce matrix sizes
Language model compression with weighted low-rank factorization - Fisher information-weighted SVD
Numerical Optimizations for Weighted Low-rank Estimation on Language Model - Iterative implementation for the above
Weighted Low-Rank Approximation (2003)
Transformers learn through gradual rank increase
Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models
Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression
KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation
TRP: Trained Rank Pruning for Efficient Deep Neural Networks - Introduces energy-pruning ratio

Fine-tuning

Sampling

Scaling

Efficiently Scaling Transformer Inference (Google Nov. 2022) - Pipeline and tensor parallelization for inference
Megatron-LM (Nvidia Mar. 2020) - Intra-layer parallelism for training

Mixture of Experts

Adaptive Mixtures of Local Experts (1991, remastered PDF)
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Google 2017)
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Google 2022)
Go Wider Instead of Deeper

Watermarking

A Watermark for Large Language Models

More

Efficient Deep Learning Systems: Week 9, Compression
The Transformer Family Version 2.0 (Lilian Weng)
Large Transformer Model Inference Optimization (Lilian Weng)