Home

Awesome

image

<div align='center'> <img src=https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg > <img src=https://img.shields.io/github/downloads/DefTruth/Awesome-LLM-Inference/total?color=ccf&label=downloads&logo=github&logoColor=lightgrey > <img src=https://img.shields.io/github/forks/DefTruth/Awesome-LLM-Inference.svg?style=social > <img src=https://img.shields.io/github/stars/DefTruth/Awesome-LLM-Inference.svg?style=social > <img src=https://img.shields.io/github/watchers/DefTruth/Awesome-LLM-Inference.svg?style=social > <img src=https://img.shields.io/badge/Release-v2.6-brightgreen.svg > <img src=https://img.shields.io/badge/License-GPLv3.0-turquoise.svg > </div>

📒Introduction

Awesome-LLM-Inference: A curated list of 📙Awesome LLM Inference Papers with Codes. For Awesome Diffusion Inference, please check 📖Awesome-Diffusion-Inference . For CUDA learn notes, please check 📖CUDA-Learn-Notes .

©️Citations

@misc{Awesome-LLM-Inference@2024,
  title={Awesome-LLM-Inference: A curated list of Awesome LLM Inference Papers with codes},
  url={https://github.com/DefTruth/Awesome-LLM-Inference},
  note={Open-source software available at https://github.com/DefTruth/Awesome-LLM-Inference},
  author={DefTruth, liyucheng09 etc},
  year={2024}
}

🎉Awesome LLM Inference Papers with Codes

Awesome LLM Inference for Beginners.pdf: 500 pages, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), PagedAttention, RoPE, SmoothQuant, WINT8/4, Continuous Batching, ZeroQuant 1/2/FP, AWQ etc.

<div align='center'> <img src=https://github.com/DefTruth/Awesome-LLM-Inference/assets/31974251/0ed77e9d-a1eb-4095-9a82-bad624964e55 > </div> <div id="paperlist"></div>

📖Contents

📖Trending LLM/VLM Topics (©️back👆🏻)

<div id="Trending-LLM-VLM-Topics"></div>
DateTitlePaperCodeRecom
2024.04🔥🔥🔥[Open-Sora] Open-Sora: Democratizing Efficient Video Production for All(@hpcaitech)[docs][Open-Sora] ⭐️⭐️
2024.04🔥🔥🔥[Open-Sora Plan] Open-Sora Plan: This project aim to reproduce Sora (Open AI T2V model)(@PKU)[report][Open-Sora-Plan] ⭐️⭐️
2024.05🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI)[pdf][DeepSeek-V2] ⭐️⭐️
2024.05🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft)[pdf][unilm-YOCO] ⭐️⭐️
2024.06🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI)[pdf][Mooncake] ⭐️⭐️
2024.07🔥🔥[FlashAttention-3] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc)[pdf][flash-attention] ⭐️⭐️
2024.07🔥🔥[MInference 1.0] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft)[pdf][MInference 1.0] ⭐️⭐️
2024.11🔥🔥🔥[Star-Attention: 11x~ speedup] Star Attention: Efficient LLM Inference over Long Sequences(@NVIDIA)[pdf][Star-Attention] ⭐️⭐️

📖DP/MP/PP/TP/SP/CP Parallelism (©️back👆🏻)

<div id="DP-MP-PP-TP-SP-CP"></div>
DateTitlePaperCodeRecom
2019.10🔥🔥[MP: ZeRO] DeepSpeed-ZeRO: Memory Optimizations Toward Training Trillion Parameter Models(@microsoft.com)[pdf][deepspeed] ⭐️⭐️
2020.05🔥🔥[TP: Megatron-LM] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA)[pdf][Megatron-LM] ⭐️⭐️
2022.05🔥🔥[SP: Megatron-LM] Megatron-LM: Reducing Activation Recomputation in Large Transformer Models(@NVIDIA)[pdf][Megatron-LM] ⭐️⭐️
2023.05🔥🔥[SP: BPT] Blockwise Parallel Transformer for Large Context Models(@UC Berkeley)[pdf][RingAttention] ⭐️⭐️
2023.10🔥🔥[SP: Ring Attention] Ring Attention with Blockwise Transformers for Near-Infinite Context(@UC Berkeley)[pdf][RingAttention] ⭐️⭐️
2023.11🔥🔥[SP: STRIPED ATTENTION] STRIPED ATTENTION: FASTER RING ATTENTION FOR CAUSAL TRANSFORMERS(@MIT etc)[pdf][striped_attention] ⭐️⭐️
2023.10🔥🔥[SP: DEEPSPEED ULYSSES] DEEPSPEED ULYSSES: SYSTEM OPTIMIZATIONS FOR ENABLING TRAINING OF EXTREME LONG SEQUENCE TRANSFORMER MODELS(@microsoft.com)[pdf][deepspeed] ⭐️⭐️
2024.03🔥🔥[CP: Megatron-LM] Megatron-LM: Context parallelism overview(@NVIDIA)[docs][Megatron-LM] ⭐️⭐️
2024.05🔥🔥[SP: Unified Sequence Parallel (USP)] YunChang: A Unified Sequence Parallel (USP) Attention for Long Context LLM Model Training and Inference(@Tencent)[pdf][long-context-attention] ⭐️⭐️
2024.11🔥🔥[CP: Meta] Context Parallelism for Scalable Million-Token Inference(@Meta Platforms, Inc)[pdf]⚠️⭐️⭐️
2024.11🔥🔥[TP: Comm Compression] Communication Compression for Tensor Parallel LLM Inference(@recogni.com)[pdf]⚠️⭐️⭐️
2024.11🔥🔥🔥[SP: Star-Attention, 11x~ speedup] Star Attention: Efficient LLM Inference over Long Sequences(@NVIDIA)[pdf][Star-Attention] ⭐️⭐️

📖LLM Algorithmic/Eval Survey (©️back👆🏻)

<div id="LLM-Algorithmic-Eval-Survey"></div>
DateTitlePaperCodeRecom
2023.10[Evaluating] Evaluating Large Language Models: A Comprehensive Survey(@tju.edu.cn)[pdf][Awesome-LLMs-Evaluation] ⭐️
2023.11🔥[Runtime Performance] Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models(@hkust-gz.edu.cn)[pdf]⚠️⭐️⭐️
2023.11[ChatGPT Anniversary] ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?(@e.ntu.edu.sg)[pdf]⚠️⭐️
2023.12[Algorithmic Survey] The Efficiency Spectrum of Large Language Models: An Algorithmic Survey(@Microsoft)[pdf]⚠️⭐️
2023.12[Security and Privacy] A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly(@Drexel University)[pdf]⚠️⭐️
2023.12🔥[LLMCompass] A Hardware Evaluation Framework for Large Language Model Inference(@princeton.edu)[pdf]⚠️⭐️⭐️
2023.12🔥[Efficient LLMs] Efficient Large Language Models: A Survey(@Ohio State University etc)[pdf][Efficient-LLMs-Survey] ⭐️⭐️
2023.12[Serving Survey] Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems(@Carnegie Mellon University)[pdf]⚠️⭐️⭐️
2024.01[Understanding LLMs] Understanding LLMs: A Comprehensive Overview from Training to Inference(@Shaanxi Normal University etc)[pdf]⚠️⭐️⭐️
2024.02[LLM-Viewer] LLM Inference Unveiled: Survey and Roofline Model Insights(@Zhihang Yuan etc)[pdf][LLM-Viewer] ⭐️⭐️
2024.07[Internal Consistency & Self-Feedback] Internal Consistency and Self-Feedback in Large Language Models: A Survey[pdf][ICSF-Survey] ⭐️⭐️
2024.09[Low-bit] A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms(@Beihang etc)[pdf]⚠️⭐️⭐️
2024.10[LLM Inference] LARGE LANGUAGE MODEL INFERENCE ACCELERATION: A COMPREHENSIVE HARDWARE PERSPECTIVE(@SJTU etc)[pdf]⚠️⭐️⭐️

📖LLM Train/Inference Framework/Design (©️back👆🏻)

<div id="LLM-Train-Inference-Framework"></div>
DateTitlePaperCodeRecom
2020.05🔥[Megatron-LM] Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA)[pdf][Megatron-LM] ⭐️⭐️
2023.03[FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU(@Stanford University etc)[pdf][FlexGen] ⭐️
2023.05[SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification(@Peking University etc)[pdf][FlexFlow] ⭐️
2023.05[FastServe] Fast Distributed Inference Serving for Large Language Models(@Peking University etc)[pdf]⚠️⭐️
2023.09🔥[vLLM] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc)[pdf][vllm] ⭐️⭐️
2023.09[StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS(@Meta AI etc)[pdf][streaming-llm] ⭐️
2023.09[Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc)[blog][Medusa] ⭐️
2023.10🔥[TensorRT-LLM] NVIDIA TensorRT LLM(@NVIDIA)[docs][TensorRT-LLM] ⭐️⭐️
2023.11🔥[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft)[pdf][deepspeed-fastgen] ⭐️⭐️
2023.12🔥[PETALS] Distributed Inference and Fine-tuning of Large Language Models Over The Internet(@HSE Univesity etc)[pdf][petals] ⭐️⭐️
2023.10[LightSeq] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers(@UC Berkeley etc)[pdf][LightSeq] ⭐️
2023.12[PowerInfer] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU(@SJTU)[pdf][PowerInfer] ⭐️
2024.01[inferflow]INFERFLOW: AN EFFICIENT AND HIGHLY CONFIGURABLE INFERENCE ENGINE FOR LARGE LANGUAGE MODELS(@Tencent AI Lab)[pdf][inferflow] ⭐️
2024.06🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI)[pdf][Mooncake] ⭐️⭐️
2023.06🔥[LMDeploy] LMDeploy: LMDeploy is a toolkit for compressing, deploying, and serving LLMs(@InternLM)[docs][lmdeploy] ⭐️⭐️
2023.05🔥[MLC-LLM]Universal LLM Deployment Engine with ML Compilation(@mlc-ai)[docs][mlc-llm] ⭐️⭐️
2023.08🔥[LightLLM] LightLLM is a Python-based LLM (Large Language Model) inference and serving framework(@ModelTC)[docs][lightllm] ⭐️⭐️
2023.03🔥[llama.cpp] llama.cpp: Inference of Meta's LLaMA model (and others) in pure C/C++(@ggerganov)[docs][llama.cpp] ⭐️⭐️
2024.02🔥[flashinfer] FlashInfer: Kernel Library for LLM Serving(@flashinfer-ai)[docs][flashinfer] ⭐️⭐️
2024.06🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI)[pdf][Mooncake] ⭐️⭐️
2024.07🔥[DynamoLLM] DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency(@Microsoft Azure Research)[pdf]⚠️⭐️
2024.08🔥[NanoFlow] NanoFlow: Towards Optimal Large Language Model Serving Throughput(@University of Washington)[pdf][Nanoflow] ⭐️⭐️
2024.08🔥[Decentralized LLM] Decentralized LLM Inference over Edge Networks with Energy Harvesting(@Padova)[pdf]⚠️⭐️
2024.11🔥[SparseInfer] SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference(@University of Seoul, etc)[pdf]⚠️⭐️

📖Continuous/In-flight Batching (©️back👆🏻)

<div id="Continuous-In-flight-Batching"></div>
DateTitlePaperCodeRecom
2022.07🔥[Continuous Batching] Orca: A Distributed Serving System for Transformer-Based Generative Models(@Seoul National University etc)[pdf]⚠️⭐️⭐️
2023.10🔥[In-flight Batching] NVIDIA TensorRT LLM Batch Manager(@NVIDIA)[docs][TensorRT-LLM] ⭐️⭐️
2023.11🔥[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft)[blog][deepspeed-fastgen] ⭐️⭐️
2023.11[Splitwise] Splitwise: Efficient Generative LLM Inference Using Phase Splitting(@Microsoft etc)[pdf]⚠️⭐️
2023.12[SpotServe] SpotServe: Serving Generative Large Language Models on Preemptible Instances(@cmu.edu etc)[pdf][SpotServe] ⭐️
2023.10[LightSeq] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers(@UC Berkeley etc)[pdf][LightSeq] ⭐️
2024.05🔥[vAttention] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention(@Microsoft Research India)[pdf][vAttention] ⭐️⭐️
2024.07🔥🔥[vTensor] vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving(@Shanghai Jiao Tong University etc)[pdf][vTensor] ⭐️⭐️
2024.08🔥[Automatic Inference Engine Tuning] Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning(@Nanjing University etc)[pdf]⚠️⭐️⭐️
2024.08🔥[SJF Scheduling] Efficient LLM Scheduling by Learning to Rank(@UCSD etc)[pdf]⚠️⭐️⭐️
2024.12🔥[BatchLLM] BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching(@Microsoft)[pdf]⚠️⭐️⭐️

📖Weight/Activation Quantize/Compress (©️back👆🏻)

<div id="Weight-Activation-Quantize-Compress"></div>
DateTitlePaperCodeRecom
2022.06🔥[ZeroQuant] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers(@Microsoft)[pdf][DeepSpeed] ⭐️⭐️
2022.08[FP8-Quantization] FP8 Quantization: The Power of the Exponent(@Qualcomm AI Research)[pdf][FP8-quantization] ⭐️
2022.08[LLM.int8()] 8-bit Matrix Multiplication for Transformers at Scale(@Facebook AI Research etc)[pdf][bitsandbytes] ⭐️
2022.10🔥[GPTQ] GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS(@IST Austria etc)[pdf][gptq] ⭐️⭐️
2022.11🔥[WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production(@NVIDIA&Microsoft)[pdf][FasterTransformer] ⭐️⭐️
2022.11🔥[SmoothQuant] Accurate and Efficient Post-Training Quantization for Large Language Models(@MIT etc)[pdf][smoothquant] ⭐️⭐️
2023.03[ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation(@Microsoft)[pdf][DeepSpeed] ⭐️
2023.06🔥[AWQ] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration(@MIT etc)[pdf][llm-awq] ⭐️⭐️
2023.06[SpQR] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression(@University of Washington etc)[pdf][SpQR] ⭐️
2023.06[SqueezeLLM] SQUEEZELLM: DENSE-AND-SPARSE QUANTIZATION(@berkeley.edu)[pdf][SqueezeLLM] ⭐️
2023.07[ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats(@Microsoft)[pdf][DeepSpeed] ⭐️
2023.09[KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization(@HPC4AI)[blog]⚠️⭐️
2023.10[FP8-LM] FP8-LM: Training FP8 Large Language Models(@Microsoft etc)[pdf][MS-AMP] ⭐️
2023.10[LLM-Shearing] SHEARED LLAMA: ACCELERATING LANGUAGE MODEL PRE-TRAINING VIA STRUCTURED PRUNING(@cs.princeton.edu etc)[pdf][LLM-Shearing] ⭐️
2023.10[LLM-FP4] LLM-FP4: 4-Bit Floating-Point Quantized Transformers(@ust.hk&meta etc)[pdf][LLM-FP4] ⭐️
2023.11[2-bit LLM] Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization(@Shanghai Jiao Tong University etc)[pdf]⚠️⭐️
2023.12[SmoothQuant+] SmoothQuant+: Accurate and Efficient 4-bit Post-Training Weight Quantization for LLM(@ZTE Corporation)[pdf][smoothquantplus] ⭐️
2023.11[OdysseyLLM W4A8] A Speed Odyssey for Deployable Quantization of LLMs(@meituan.com)[pdf]⚠️⭐️
2023.12🔥[SparQ] SPARQ ATTENTION: BANDWIDTH-EFFICIENT LLM INFERENCE(@graphcore.ai)[pdf]⚠️⭐️⭐️
2023.12[Agile-Quant] Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge(@Northeastern University&Oracle)[pdf]⚠️⭐️
2023.12[CBQ] CBQ: Cross-Block Quantization for Large Language Models(@ustc.edu.cn)[pdf]⚠️⭐️
2023.10[QLLM] QLLM: ACCURATE AND EFFICIENT LOW-BITWIDTH QUANTIZATION FOR LARGE LANGUAGE MODELS(@ZIP Lab&SenseTime Research etc)[pdf]⚠️⭐️
2024.01[FP6-LLM] FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design(@Microsoft etc)[pdf]⚠️⭐️
2024.05🔥🔥[W4A8KV4] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving(@MIT&NVIDIA)[pdf][qserve] ⭐️⭐️
2024.05🔥[SpinQuant] SpinQuant: LLM Quantization with Learned Rotations(@Meta)[pdf]⚠️⭐️
2024.05🔥[I-LLM] I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models(@Houmo AI)[pdf]⚠️⭐️
2024.06🔥[OutlierTune] OutlierTune: Efficient Channel-Wise Quantization for Large Language Models(@Beijing University)[pdf]⚠️⭐️
2024.06🔥[GPTQT] GPTQT: Quantize Large Language Models Twice to Push the Efficiency(@zju)[pdf]⚠️⭐️
2024.08🔥[ABQ-LLM] ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models(@ByteDance)[pdf][ABQ-LLM] ⭐️
2024.08🔥[1-bit LLMs] Matmul or No Matmal in the Era of 1-bit LLMs(@University of South Carolina)[pdf]⚠️⭐️
2024.08🔥[ACTIVATION SPARSITY] TRAINING-FREE ACTIVATION SPARSITY IN LARGE LANGUAGE MODELS(@MIT etc)[pdf][TEAL] ⭐️
2024.09🔥[VPTQ] VPTQ: EXTREME LOW-BIT VECTOR POST-TRAINING QUANTIZATION FOR LARGE LANGUAGE MODELS(@Microsoft)[pdf][VPTQ] ⭐️
2024.11🔥[BitNet] BitNet a4.8: 4-bit Activations for 1-bit LLMs(@Microsoft)[pdf][bitnet] ⭐️

📖IO/FLOPs-Aware/Sparse Attention (©️back👆🏻)

<div id="IO-FLOPs-Aware-Attention-Sparse"></div>
DateTitlePaperCodeRecom
2018.05[Online Softmax] Online normalizer calculation for softmax(@NVIDIA)[pdf]⚠️⭐️
2019.11🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need(@Google)[pdf]⚠️⭐️⭐️
2020.10[Hash Attention] REFORMER: THE EFFICIENT TRANSFORMER(@Google)[pdf][reformer] ⭐️⭐️
2022.05🔥[FlashAttention] Fast and Memory-Efficient Exact Attention with IO-Awareness(@Stanford University etc)[pdf][flash-attention] ⭐️⭐️
2022.10[Online Softmax] SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY(@Google)[pdf]⚠️⭐️
2023.05[FlashAttention] From Online Softmax to FlashAttention(@cs.washington.edu)[pdf]⚠️⭐️⭐️
2023.05[FLOP, I/O] Dissecting Batching Effects in GPT Inference(@Lequn Chen)[blog]⚠️⭐️
2023.05🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(@Google)[pdf][flaxformer] ⭐️⭐️
2023.06[Sparse FlashAttention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc)[pdf][dynamic-sparse-flash-attention] ⭐️
2023.07🔥[FlashAttention-2] Faster Attention with Better Parallelism and Work Partitioning(@Stanford University etc)[pdf][flash-attention] ⭐️⭐️
2023.10🔥[Flash-Decoding] Flash-Decoding for long-context inference(@Stanford University etc)[blog][flash-attention] ⭐️⭐️
2023.11[Flash-Decoding++] FLASHDECODING++: FASTER LARGE LANGUAGE MODEL INFERENCE ON GPUS(@Tsinghua University&Infinigence-AI)[pdf]⚠️⭐️
2023.01[SparseGPT] SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot(@ISTA etc)[pdf][sparsegpt] ⭐️
2023.12🔥[GLA] Gated Linear Attention Transformers with Hardware-Efficient Training(@MIT-IBM Watson AI)[pdf]gated_linear_attention ⭐️⭐️
2023.12[SCCA] SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion(@Beihang University)[pdf]⚠️⭐️
2023.12🔥[FlashLLM] LLM in a flash: Efficient Large Language Model Inference with Limited Memory(@Apple)[pdf]⚠️⭐️⭐️
2024.03🔥🔥[CHAI] CHAI: Clustered Head Attention for Efficient LLM Inference(@cs.wisc.edu etc)[pdf]⚠️⭐️⭐️
2024.04🔥🔥[DeFT] DeFT: Decoding with Flash Tree-Attention for Efficient Tree-structured LLM Inference(@Westlake University etc)[pdf]⚠️⭐️⭐️
2024.04[MoA] MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression(@thu et el.)[pdf][MoA] ⭐️
2024.07🔥🔥[FlashAttention-3] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc)[pdf][flash-attention] ⭐️⭐️
2024.07🔥🔥[MInference 1.0] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft)[pdf][MInference 1.0] ⭐️⭐️
2024.07🔥🔥[Shared Attention] Beyond KV Caching: Shared Attention for Efficient LLMs(@Kyushu University etc)[pdf][shareAtt] ⭐️
2024.09🔥🔥[CHESS] CHESS : Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification(@Wuhan University)[pdf]⚠️⭐️⭐️
2024.09🔥🔥[INT-FLASHATTENTION] INT-FLASHATTENTION: ENABLING FLASH ATTENTION FOR INT8 QUANTIZATION(@PKU etc)[pdf][INT-FlashAttention] ⭐️
2024.10🔥🔥[SageAttention] SAGEATTENTION: ACCURATE 8-BIT ATTENTION FOR PLUG-AND-PLAY INFERENCE ACCELERATION(@thu-ml)[pdf][SageAttention] ⭐️⭐️
2024.11🔥🔥[SageAttention-2] SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration(@thu-ml)[pdf][SageAttention] ⭐️⭐️
2024.11🔥🔥[Squeezed Attention] SQUEEZED ATTENTION: Accelerating Long Context Length LLM Inference(@UC Berkeley)[pdf][SqueezedAttention] ⭐️⭐️
2024.12🔥🔥[TurboAttention] TURBOATTENTION: EFFICIENT ATTENTION APPROXIMATION FOR HIGH THROUGHPUTS LLMS(@Microsoft)[pdf]⚠️⭐️⭐️

📖KV Cache Scheduling/Quantize/Dropping (©️back👆🏻)

<div id="KV-Cache-Scheduling-Quantize-Dropping"></div>
DateTitlePaperCodeRecom
2019.11🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need(@Google)[pdf]⚠️⭐️⭐️
2022.06[LTP] Learned Token Pruning for Transformers(@UC Berkeley etc)[pdf][LTP] ⭐️
2023.05🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(@Google)[pdf][flaxformer] ⭐️⭐️
2023.05[KV Cache Compress] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time(@)[pdf]⚠️⭐️⭐️
2023.06[H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models(@Rice University etc)[pdf][H2O] ⭐️
2023.06[QK-Sparse/Dropping Attention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc)[pdf][dynamic-sparse-flash-attention] ⭐️
2023.08🔥🔥[Chunked Prefills] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills(@Microsoft etc)[pdf]⚠️⭐️⭐️
2023.09🔥🔥[PagedAttention] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc)[pdf][vllm] ⭐️⭐️
2023.09[KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization(@HPC4AI)[blog]⚠️⭐️
2023.10🔥[TensorRT-LLM KV Cache FP8] NVIDIA TensorRT LLM(@NVIDIA)[docs][TensorRT-LLM] ⭐️⭐️
2023.10🔥[Adaptive KV Cache Compress] MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS(@illinois.edu&microsoft)[pdf]⚠️⭐️⭐️
2023.10[CacheGen] CacheGen: Fast Context Loading for Language Model Applications(@Chicago University&Microsoft)[pdf][LMCache] ⭐️
2023.12[KV-Cache Optimizations] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO(@Haim Barad etc)[pdf]⚠️⭐️
2023.12[KV Cache Compress with LoRA] Compressed Context Memory for Online Language Model Interaction (@SNU & NAVER AI)[pdf][Compressed-Context-Memory] ⭐️⭐️
2023.12🔥🔥[RadixAttention] Efficiently Programming Large Language Models using SGLang(@Stanford University etc)[pdf][sglang] ⭐️⭐️
2024.01🔥🔥[DistKV-LLM] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache(@Alibaba etc)[pdf]⚠️⭐️⭐️
2024.02🔥🔥[Prompt Caching] Efficient Prompt Caching via Embedding Similarity(@UC Berkeley)[pdf]⚠️⭐️⭐️
2024.02🔥🔥[Less] Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference(@CMU etc)[pdf]⚠️⭐️
2024.02🔥🔥[MiKV] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization(@KAIST)[pdf]⚠️⭐️
2024.02🔥🔥[Shared Prefixes] Hydragen: High-Throughput LLM Inference with Shared Prefixes[pdf]⚠️⭐️⭐️
2024.02🔥🔥[ChunkAttention] ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition(@microsoft.com)[pdf][chunk-attention] ⭐️⭐️
2024.03🔥[QAQ] QAQ: Quality Adaptive Quantization for LLM KV Cache(@@smail.nju.edu.cn)[pdf][QAQ-KVCacheQuantization] ⭐️⭐️
2024.03🔥🔥[DMC] Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference(@NVIDIA etc)[pdf]⚠️⭐️⭐️
2024.03🔥🔥[Keyformer] Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference(@ece.ubc.ca etc)[pdf][Keyformer] ⭐️⭐️
2024.03[FASTDECODE] FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous(@Tsinghua University)[pdf]⚠️⭐️⭐️
2024.03[Sparsity-Aware KV Caching] ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching(@ucf.edu)[pdf]⚠️⭐️⭐️
2024.03🔥[GEAR] GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM(@gatech.edu)[pdf][GEAR] ⭐️
2024.04[SqueezeAttention] SQUEEZEATTENTION: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget(@lzu.edu.cn etc)[pdf][SqueezeAttention] ⭐️⭐️
2024.04[SnapKV] SnapKV: LLM Knows What You are Looking for Before Generation(@UIUC)[pdf][SnapKV] ⭐️
2024.05🔥[vAttention] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention(@Microsoft Research India)[pdf][vAttention] ⭐️⭐️
2024.05🔥[KVCache-1Bit] KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization(@Rice University)[pdf]⚠️⭐️⭐️
2024.05🔥[KV-Runahead] KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation(@Apple etc)[pdf]⚠️⭐️⭐️
2024.05🔥[ZipCache] ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification(@Zhejiang University etc)[pdf]⚠️⭐️⭐️
2024.05🔥[MiniCache] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models(@ZIP Lab)[pdf]⚠️⭐️⭐️
2024.05🔥[CacheBlend] CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion(@University of Chicago)[pdf][LMCache] ⭐️⭐️
2024.06🔥[CompressKV] Effectively Compress KV Heads for LLM(@alibaba etc)[pdf]⚠️⭐️⭐️
2024.06🔥[MemServe] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool(@Huawei Cloud etc)[pdf]⚠️⭐️⭐️
2024.07🔥[MLKV] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding(@Institut Teknologi Bandung)[pdf][pythia-mlkv] ⭐️
2024.07🔥[ThinK] ThinK: Thinner Key Cache by Query-Driven Pruning(@Salesforce AI Research etc)[pdf]⚠️⭐️⭐️
2024.07🔥[Palu] Palu: Compressing KV-Cache with Low-Rank Projection(@nycu.edu.tw)[pdf][Palu] ⭐️⭐️
2024.08🔥[Zero-Delay QKV Compression] Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference(@University of Virginia)[pdf]⚠️⭐️⭐️
2024.09🔥[AlignedKV] AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization(@Tsinghua University)[pdf][AlignedKV] ⭐️
2024.10🔥[LayerKV] Optimizing Large Language Model Serving with Layer-wise KV Cache Management(@Ant Group)[pdf]⚠️⭐️⭐️
2024.10🔥[AdaKV] Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference (@USTC)[pdf][AdaKV] ⭐️⭐️
2024.11🔥[KV Cache Recomputation] Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation(@University of Southern California)[pdf]⚠️⭐️⭐️
2024.12🔥[ClusterKV] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression(@sjtu)[pdf]⚠️⭐️⭐️
2024.12🔥[DynamicKV] DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs(@xiabinzhou0625 etc)[pdf]⚠️⭐️⭐️

📖Prompt/Context/KV Compression (©️back👆🏻)

<div id="Context-Compression"></div>
DateTitlePaperCodeRecom
2023.04🔥[Selective-Context] Compressing Context to Enhance Inference Efficiency of Large Language Models(@Surrey)[pdf]Selective-Context ⭐️⭐️
2023.05[AutoCompressor] Adapting Language Models to Compress Contextss(@Princeton)[pdf]AutoCompressor ⭐️
2023.10🔥[LLMLingua] LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models(@Microsoft)[pdf]LLMLingua ⭐️⭐️
2023.10🔥🔥[LongLLMLingua] LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression(@Microsoft)[pdf]LLMLingua ⭐️⭐️
2024.03🔥[LLMLingua-2] LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression(@Microsoft)[pdf]LLMLingua series ⭐️
2024.08🔥🔥[500xCompressor] 500xCompressor: Generalized Prompt Compression for Large Language Models(@University of Cambridge)[pdf]⚠️⭐️⭐️
2024.08🔥🔥[Eigen Attention] Eigen Attention: Attention in Low-Rank Space for KV Cache Compression(@purdue.edu)[pdf]⚠️⭐️⭐️
2024.09🔥🔥[Prompt Compression] Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference(@Alterra AI)[pdf]⚠️⭐️⭐️
2024.09🔥🔥[Context Distillation] Efficient LLM Context Distillation(@gatech.edu)[pdf]⚠️⭐️⭐️
2024.09🔥🔥[CRITIPREFILL] CRITIPREFILL: A SEGMENT-WISE CRITICALITYBASED APPROACH FOR PREFILLING ACCELERATION IN LLMS(@OPPO)[pdf]CritiPrefill ⭐️
2024.10🔥🔥[KV-COMPRESS] PAGED KV-CACHE COMPRESSION WITH VARIABLE COMPRESSION RATES PER ATTENTION HEAD(@Cloudflare, inc.)[pdf]vllm-kvcompress ⭐️⭐️
2024.10🔥🔥[LORC] Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy(@gatech.edu)[pdf]⚠️⭐️⭐️

📖Long Context Attention/KV Cache Optimization (©️back👆🏻)

<div id="Long-Context-Attention-KVCache"></div>
DateTitlePaperCodeRecom
2023.05🔥🔥[Blockwise Attention] Blockwise Parallel Transformer for Large Context Models(@UC Berkeley)[pdf]⚠️⭐️⭐️
2023.05🔥[Landmark Attention] Random-Access Infinite Context Length for Transformers(@epfl.ch)[pdf]landmark-attention ⭐️⭐️
2023.07🔥[LightningAttention-1] TRANSNORMERLLM: A FASTER AND BETTER LARGE LANGUAGE MODEL WITH IMPROVED TRANSNORMER(@OpenNLPLab)[pdf]TransnormerLLM ⭐️⭐️
2023.07🔥[LightningAttention-2] Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models(@OpenNLPLab)[pdf]lightning-attention ⭐️⭐️
2023.10🔥🔥[RingAttention] Ring Attention with Blockwise Transformers for Near-Infinite Context(@UC Berkeley)[pdf][RingAttention] ⭐️⭐️
2023.11🔥[HyperAttention] HyperAttention: Long-context Attention in Near-Linear Time(@yale&Google)[pdf]hyper-attn ⭐️⭐️
2023.11[Streaming Attention] One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space(@Adobe Research etc)[pdf]⚠️⭐️
2023.11🔥[Prompt Cache] PROMPT CACHE: MODULAR ATTENTION REUSE FOR LOW-LATENCY INFERENCE(@Yale University etc)[pdf]⚠️⭐️⭐️
2023.11🔥🔥[StripedAttention] STRIPED ATTENTION: FASTER RING ATTENTION FOR CAUSAL TRANSFORMERS(@MIT etc)[pdf][striped_attention] ⭐️⭐️
2024.01🔥🔥[KVQuant] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization(@UC Berkeley)[pdf][KVQuant] ⭐️⭐️
2024.02🔥[RelayAttention] RelayAttention for Efficient Large Language Model Serving with Long System Prompts(@sensetime.com etc)[pdf]⚠️⭐️⭐️
2024.04🔥🔥[Infini-attention] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention(@Google)[pdf]⚠️⭐️⭐️
2024.04🔥🔥[RAGCache] RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation(@Peking University&ByteDance Inc)[pdf]⚠️⭐️⭐️
2024.04🔥🔥[KCache] EFFICIENT LLM INFERENCE WITH KCACHE(@Qiaozhi He, Zhihua Wu)[pdf]⚠️⭐️⭐️
2024.05🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft)[pdf][unilm-YOCO] ⭐️⭐️
2024.05🔥🔥[SKVQ] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models(@Shanghai AI Laboratory)[pdf]⚠️⭐️⭐️
2024.05🔥🔥[CLA] Reducing Transformer Key-Value Cache Size with Cross-Layer Attention(@MIT-IBM)[pdf]⚠️⭐️⭐️
2024.06🔥[LOOK-M] LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference(@osu.edu etc)[pdf][LOOK-M] ⭐️⭐️
2024.06🔥🔥[MInference] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft etc)[pdf][MInference] ⭐️⭐️
2024.06🔥🔥[InfiniGen] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management(@snu)[pdf]⚠️⭐️⭐️
2024.06🔥🔥[Quest] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference(@mit-han-lab etc)[pdf][Quest] ⭐️⭐️
2024.07🔥[PQCache] PQCache: Product Quantization-based KVCache for Long Context LLM Inference(@PKU etc)[pdf]⚠️⭐️⭐️
2024.08🔥[SentenceVAE] SentenceVAE: Faster, Longer and More Accurate Inference with Next-sentence Prediction for Large Language Models(@TeleAI)[pdf]⚠️⭐️⭐️
2024.09🔥[InstInfer] InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference(@PKU etc)[pdf]⚠️⭐️⭐️
2024.09🔥[RetrievalAttention] RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval(@microsoft.com)[pdf]⚠️⭐️⭐️
2024.10🔥[ShadowKV] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference(@CMU & bytedance)[pdf][ShadowKV] ⭐️⭐️

📖Early-Exit/Intermediate Layer Decoding (©️back👆🏻)

<div id="Early-Exit"></div>
DateTitlePaperCodeRecom
2020.04[DeeBERT] DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference(@uwaterloo.ca)[pdf]⚠️⭐️
2020.04[FastBERT] FastBERT: a Self-distilling BERT with Adaptive Inference Time(@PKU)[pdf][FastBERT] ⭐️
2021.06[BERxiT] BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression(@uwaterloo.ca)[pdf][berxit] ⭐️
2023.06🔥[SkipDecode] SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference(@Microsoft)[pdf]⚠️⭐️
2023.10🔥[LITE] Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE(@Arizona State University)[pdf]⚠️⭐️⭐️
2023.12🔥🔥[EE-LLM] EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism(@alibaba-inc.com)[pdf][EE-LLM] ⭐️⭐️
2023.10🔥[FREE] Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding(@KAIST AI&AWS AI)[pdf][fast_robust_early_exit] ⭐️⭐️
2024.02🔥[EE-Tuning] EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models(@alibaba-inc.com)[pdf][EE-Tuning] ⭐️⭐️
2024.07[Skip Attention] Attention Is All You Need But You Don’t Need All Of It For Inference of Large Language Models(@University College London)[pdf]⚠️⭐️⭐️
2024.08[KOALA] KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning(@Dalian University)[pdf]⚠️⭐️⭐️

📖Parallel Decoding/Sampling (©️back👆🏻)

<div id="Parallel-Decoding-Sampling"></div>
DateTitlePaperCodeRecom
2018.11🔥[Parallel Decoding] Blockwise Parallel Decoding for Deep Autoregressive Models(@Berkeley&Google)[pdf]⚠️⭐️⭐️
2023.02🔥[Speculative Sampling] Accelerating Large Language Model Decoding with Speculative Sampling(@DeepMind)[pdf]⚠️⭐️⭐️
2023.05🔥[Speculative Sampling] Fast Inference from Transformers via Speculative Decoding(@Google Research etc)[pdf][LLMSpeculativeSampling] ⭐️⭐️
2023.09🔥[Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc)[pdf][Medusa] ⭐️⭐️
2023.10[OSD] Online Speculative Decoding(@UC Berkeley etc)[pdf]⚠️⭐️⭐️
2023.12[Cascade Speculative] Cascade Speculative Drafting for Even Faster LLM Inference(@illinois.edu)[pdf]⚠️⭐️
2024.02🔥[LookaheadDecoding] Break the Sequential Dependency of LLM Inference Using LOOKAHEAD DECODING(@UCSD&Google&UC Berkeley)[pdf][LookaheadDecoding] ⭐️⭐️
2024.02🔥🔥[Speculative Decoding] Decoding Speculative Decoding(@cs.wisc.edu)[pdf]Decoding Speculative Decoding ⭐️
2024.04🔥🔥[TriForce] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding(@cmu.edu&Meta AI)[pdf][TriForce] ⭐️⭐️
2024.04🔥🔥[Hidden Transfer] Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration(@pku.edu.cn etc)[pdf]⚠️⭐️
2024.05🔥[Instructive Decoding] INSTRUCTIVE DECODING: INSTRUCTION-TUNED LARGE LANGUAGE MODELS ARE SELF-REFINER FROM NOISY INSTRUCTIONS(@KAIST AI)[pdf][Instructive-Decoding] ⭐️
2024.05🔥[S3D] S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs(@lge.com)[pdf]⚠️⭐️
2024.06🔥[Parallel Decoding] Exploring and Improving Drafts in Blockwise Parallel Decoding(@KAIST&Google Research)[pdf]⚠️⭐️⭐️
2024.07🔥[Multi-Token Speculative Decoding] Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference(@University of California, etc)[pdf]⚠️⭐️⭐️
2024.08🔥[Token Recycling] Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling(@ir.hit.edu.cn etc)[pdf]⚠️⭐️⭐️
2024.08🔥[Speculative Decoding] Parallel Speculative Decoding with Adaptive Draft Length(@USTC etc)[pdf][PEARL] ⭐️⭐️
2024.08🔥[FocusLLM] FocusLLM: Scaling LLM’s Context by Parallel Decoding(@Tsinghua University etc)[pdf][FocusLLM] ⭐️
2024.08🔥[MagicDec] MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding(@CMU etc)[pdf][MagicDec] ⭐️
2024.08🔥[Speculative Decoding] Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation(@BIT)[pdf]⚠️⭐️⭐️
2024.09🔥[Hybrid Inference] Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance[pdf]⚠️⭐️⭐️
2024.10🔥[PARALLELSPEC] PARALLELSPEC: PARALLEL DRAFTER FOR EFFICIENT SPECULATIVE DECODING(@Tencent AI Lab etc)[pdf]⚠️⭐️⭐️
2024.10🔥[Fast Best-of-N] Fast Best-of-N Decoding via Speculative Rejection(@CMU etc)[pdf]⚠️⭐️⭐️

📖Structured Prune/KD/Weight Sparse (©️back👆🏻)

<div id="Structured_Pruning_KD_Weight_Sparse"></div>
DateTitlePaperCodeRecom
2023.12[FLAP] Fluctuation-based Adaptive Structured Pruning for Large Language Models(@Chinese Academy of Sciences etc)[pdf][FLAP] ⭐️⭐️
2023.12🔥[LASER] The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction(@mit.edu)[pdf][laser] ⭐️⭐️
2023.12[PowerInfer] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU(@SJTU)[pdf][PowerInfer] ⭐️
2024.01[Admm Pruning] Fast and Optimal Weight Update for Pruned Large Language Models(@fmph.uniba.sk)[pdf][admm-pruning] ⭐️
2024.01[FFSplit] FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference(@1Rice University etc)[pdf]⚠️⭐️

📖Mixture-of-Experts(MoE) LLM Inference (©️back👆🏻)

<div id="Mixture_of_Experts_LLM_Inference"></div>
DateTitlePaperCodeRecom
2022.11🔥[WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production(@NVIDIA&Microsoft)[pdf][FasterTransformer] ⭐️⭐️
2023.12🔥 [Mixtral Offloading] Fast Inference of Mixture-of-Experts Language Models with Offloading(@Moscow Institute of Physics and Technology etc)[pdf][mixtral-offloading] ⭐️⭐️
2024.01[MoE-Mamba] MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts(@uw.edu.pl)[pdf]⚠️⭐️
2024.04[MoE Inference] Toward Inference-optimal Mixture-of-Expert Large Language Models(@UC San Diego etc)[pdf]⚠️⭐️
2024.05🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI)[pdf][DeepSeek-V2] ⭐️⭐️
2024.06[MoE] A Survey on Mixture of Experts(@HKU)[pdf]⚠️⭐️

📖CPU/Single GPU/FPGA/NPU/Mobile Inference (©️back👆🏻)

<div id="CPU-Single-GPU-Inference"></div>
DateTitlePaperCodeRecom
2023.03[FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU(@Stanford University etc)[pdf][FlexGen] ⭐️
2023.11[LLM CPU Inference] Efficient LLM Inference on CPUs(@intel)[pdf][intel-extension-for-transformers] ⭐️
2023.12[LinguaLinked] LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices(@University of California Irvine)[pdf]⚠️⭐️
2023.12[OpenVINO] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO(@Haim Barad etc)[pdf]⚠️⭐️
2024.03[FlightLLM] FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs(@Infinigence-AI)[pdf]⚠️⭐️
2024.03[Transformer-Lite] Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs(@OPPO)[pdf]⚠️⭐️
2024.07🔥🔥[xFasterTransformer] Inference Performance Optimization for Large Language Models on CPUs(@Intel)[pdf][xFasterTransformer] ⭐️
2024.07[Summary] Inference Optimization of Foundation Models on AI Accelerators(@AWS AI)[pdf]⚠️⭐️
2024.10Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation(@SYSU)[pdf]⚠️⭐️
2024.10🔥🔥[FastAttention] FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs for Efficient Inference(@huawei etc)[pdf]⚠️⭐️
2024.12🔥🔥[NITRO] NITRO: LLM INFERENCE ON INTEL® LAPTOP NPUS(@cornell.edu)[pdf][nitro] ⭐️

📖Non Transformer Architecture (©️back👆🏻)

<div id="Non-Transformer-Architecture"></div>
DateTitlePaperCodeRecom
2023.05🔥🔥[RWKV] RWKV: Reinventing RNNs for the Transformer Era(@Bo Peng etc)[pdf][RWKV-LM] ⭐️⭐️
2023.12🔥🔥[Mamba] Mamba: Linear-Time Sequence Modeling with Selective State Spaces(@cs.cmu.edu etc)[pdf][mamba] ⭐️⭐️
2024.06🔥🔥[RWKV-CLIP] RWKV-CLIP: A Robust Vision-Language Representation Learner(@DeepGlint etc)[pdf][RWKV-CLIP] ⭐️⭐️
2024.08🔥🔥[Kraken] Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference(@Princeton)[pdf]⚠️⭐️
2024.08🔥🔥[FLA] FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism(@sustcsonglin)[docs][flash-linear-attention] ⭐️⭐️

📖GEMM/Tensor Cores/MMA/Parallel (©️back👆🏻)

<div id="GEMM-Tensor-Cores-WMMA"></div>
DateTitlePaperCodeRecom
2018.03🔥🔥[Tensor Core] NVIDIA Tensor Core Programmability, Performance & Precision(@KTH Royal etc)[pdf]⚠️⭐️
2021.05🔥[Intra-SM Parallelism] Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks(@sjtu.edu.cn)[pdf]⚠️⭐️
2022.06[Microbenchmark] Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors(@tue.nl etc)[pdf][DissectingTensorCores] ⭐️
2022.09🔥🔥[FP8] FP8 FORMATS FOR DEEP LEARNING(@NVIDIA)[pdf]⚠️⭐️
2023.08🔥[Tensor Cores] Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library(@Tokyo Institute etc)[pdf][wmma_extension] ⭐️
2023.03🔥🔥[cutlass/cute] Graphene: An IR for Optimized Tensor Computations on GPUs(@NVIDIA)[pdf][cutlass] ⭐️
2024.02[QUICK] QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference(@SqueezeBits Inc)[pdf][QUICK] ⭐️⭐️
2024.02[Tensor Parallel] TP-AWARE DEQUANTIZATION(@IBM T.J. Watson Research Center)[pdf]⚠️⭐️
2024.07🔥🔥[flute] Fast Matrix Multiplications for Lookup Table-Quantized LLMs(@mit.edu etc)[pdf][flute] ⭐️⭐️
2024.08🔥🔥[LUT TENSOR CORE] LUT TENSOR CORE: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration(@SJTU&PKU etc)[pdf]⚠️⭐️
2024.08🔥🔥[MARLIN] MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models(@ISTA)[pdf][marlin] ⭐️⭐️
2024.08🔥🔥[SpMM] High Performance Unstructured SpMM Computation Using Tensor Cores(@ETH Zurich)[pdf]⚠️⭐️
2024.09🔥🔥[TEE]Confidential Computing on nVIDIA H100 GPU: A Performance Benchmark Study(@phala.network)[pdf]⚠️⭐️
2024.09🔥🔥[HiFloat8] Ascend HiFloat8 Format for Deep Learning(@Huawei)[pdf]⚠️⭐️
2024.09🔥🔥[Tensor Cores] Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores(@nju.edu.cn)[pdf]⚠️⭐️
2024.07🔥🔥[Tensor Product] Acceleration of Tensor-Product Operations with Tensor Cores(@Heidelberg University)[pdf]⚠️⭐️
2024.12🔥🔥[HADACORE] HADACORE: TENSOR CORE ACCELERATED HADAMARD TRANSFORM KERNEL(@Meta)[pdf][hadamard_transform] ⭐️

📖VLM/Position Embed/Others (©️back👆🏻)

<div id="Others"></div>
DateTitlePaperCodeRecom
2021.04🔥[RoPE] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING(@Zhuiyi Technology Co., Ltd.)[pdf][transformers] ⭐️
2022.10[ByteTransformer] A High-Performance Transformer Boosted for Variable-Length Inputs(@ByteDance&NVIDIA)[pdf][ByteTransformer] ⭐️
2024.09🔥[Inf-MLLM] Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU(@sjtu)[pdf]⚠️⭐️
2024.11🔥[VL-CACHE] VL-CACHE: SPARSITY AND MODALITY-AWARE KV CACHE COMPRESSION FOR VISION-LANGUAGE MODEL INFERENCE ACCELERATION(@g.ucla.edu etc)[pdf]⚠️⭐️

©️License

GNU General Public License v3.0

🎉Contribute

Welcome to star & submit a PR to this repo!

<!-- <div align='center'> <img width="450" height="250" alt="v02" src="https://github.com/DefTruth/LLMs-Inference-Papers/assets/31974251/bb136842-8054-4599-8bfe-36c36f0e997f"> <a href="https://star-history.com/#DefTruth/Awesome-LLM-Inference&Date"> <picture align='center'> <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=DefTruth/Awesome-LLM-Inference&type=Date&theme=dark" /> <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=DefTruth/Awesome-LLM-Inference&type=Date" /> <img width="350" height="250" alt="Star History Chart" src="https://api.star-history.com/svg?repos=DefTruth/Awesome-LLM-Inference&type=Date" /> </picture> </a> </div> -->