Home

Awesome

<!-- ![llm-kv-cache](https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache/assets/31974251/4d9ab775-f200-471d-a289-e2b14296b633) --> <!-- <div align='center'> <img src=https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg > <img src=https://img.shields.io/github/downloads/Zefan-Cai/Awesome-LLM-KV-Cache/total?color=ccf&label=downloads&logo=github&logoColor=lightgrey > <img src=https://img.shields.io/github/forks/Zefan-Cai/Awesome-LLM-KV-Cache.svg?style=social > <img src=https://img.shields.io/github/stars/Zefan-Cai/Awesome-LLM-KV-Cache.svg?style=social > <img src=https://img.shields.io/github/watchers/Zefan-Cai/Awesome-LLM-KV-Cache.svg?style=social > <img src=https://img.shields.io/badge/Release-v1.6-brightgreen.svg > <img src=https://img.shields.io/badge/License-GPLv3.0-turquoise.svg > </div> -->

📒Introduction

Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes. This repository is for personal use of learning and classifying the burning KV Cache related papers!

©️Citations

📖Contents

<div id="paperlist"></div>

📖Trending Inference Topics (©️back👆🏻)

<div id="Trending-Inference-Topics"></div>
DateTitlePaperCodeRecomComment
2024.05🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI)[pdf][DeepSeek-V2] ⭐️⭐️⭐️
2024.05🔥🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft)[pdf][unilm-YOCO] ⭐️⭐️⭐️
2024.06🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI)[pdf][Mooncake] ⭐️⭐️⭐️
2024.07🔥🔥🔥[FlashAttention-3] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc)[pdf][flash-attention] ⭐️⭐️⭐️
2024.07🔥🔥🔥[MInference 1.0] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft)[pdf][MInference 1.0] ⭐️⭐️⭐️

LLM KV Cache Compression (©️back👆🏻)

<div id="#KV-Cache-Compression"></div>
DateTitlePaperCodeRecomComment
2023.06🔥🔥[H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models[pdf][H2O] ⭐️⭐️⭐️Attention-based selection
2023.09🔥🔥🔥[StreamingLLM] Efficient Streaming Language Models with Attention Sinks[pdf][streaming-llm] ⭐️⭐️⭐️Retain first few tokens
2023.10🔥[FastGen] Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs[pdf]⭐️⭐️Head-specific compression strategies
2023.10🔥🔥[CacheGen] KV Cache Compression and Streaming for Fast Large Language Model Serving[pdf][LMCache] ⭐️⭐️⭐️Compress KV cache to bitstreams for storage and sharing
2024.04🔥🔥[SnapKV] SnapKV: LLM Knows What You are Looking for Before Generation[pdf][SnapKV] ⭐️⭐️⭐️Attention Pooling before selection
2024.05[Scissorhands] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time[pdf]⭐️
2024.06🔥A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression[pdf]⭐️L2 Norm is better than attention as a metrics
2024.06CORM: Cache Optimization with Recent Message for Large Language Model Inference[pdf]⭐️
2024.07Efficient Sparse Attention needs Adaptive Token Release[pdf]⭐️
2024.03[ALISA] ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching[pdf]⭐️
2024.03🔥🔥🔥[FastV] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models[pdf][EasyKV] ⭐️⭐️⭐️
2024.03[Keyformer] Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference[pdf][keyformer-llm] ⭐️⭐️
2024.06Effectively Compress KV Heads for LLM[pdf]⭐️
2024.06🔥 Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters[pdf]⭐️
2024.06On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference[pdf][EasyKV] ⭐️

KV Cache Merge (©️back👆🏻)

<div id="KV-Cache-Merge"></div>
DateTitlePaperCodeRecomComment
2023.10🔥🔥[CacheBlend] Fast Large Language Model Serving for RAG with Cached Knowledge Fusion[pdf][LMCache] ⭐️⭐️⭐️Selective update when merging KV caches
2023.12🔥 Compressed Context Memory For Online Language Model Interaction[pdf][ContextMemory] ⭐️⭐️⭐️Finetuning LLMs to recurrently compress KV caches
2024.01[CaM] CaM: Cache Merging for Memory-efficient LLMs Inference[pdf][cam] ⭐️⭐️
2024.05🔥🔥 You Only Cache Once: Decoder-Decoder Architectures for Language Models[pdf][unilm] ⭐️⭐️
2024.06🔥🔥[D2O] D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models[pdf]⭐️⭐️⭐️
2024.07🔥 [KVMerger]Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks[pdf]⭐️⭐️⭐️

Budget Allocation (©️back👆🏻)

<div id="Budget-Allocation"></div>
DateTitlePaperCodeRecomComment
2024.05🔥[PyramidInfer] PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference[pdf][PyramidInfer] ⭐️⭐️⭐️Layer-wise budget allocation
2024.06🔥[PyramidKV] PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling[pdf][PyramidKV] ⭐️⭐️⭐️Layer-wise budget allocation
2024.07🔥[Ada-KV] Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference[pdf]⭐️⭐️⭐️Head-wise budget allocation
2024.07RazorAttention: Efficient KV Cache Compression Through Retrieval Heads[pdf]⭐️

Cross-Layer KV Cache Utilization (©️back👆🏻)

<div id="Cross-Layer-KV-Cache-Utilization"></div>
DateTitlePaperCodeRecomComment
2024.05🔥 Reducing Transformer Key-Value Cache Size with Cross-Layer Attention[pdf]⭐️
2024.05🔥 Layer-Condensed KV Cache for Efficient Inference of Large Language Models[pdf][LCKV] ⭐️⭐️
2024.05🔥🔥🔥[MiniCache] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models[pdf]⭐️⭐️⭐️
2024.06🔥[MLKV] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding[pdf][pythia-mlkv] ⭐️⭐️

KV Cache Quantization (©️back👆🏻)

<div id="KV-Cache-Quantization"></div>
DateTitlePaperCodeRecomComment
2023.03🔥[GEAR] GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM[pdf][GEAR] ⭐️⭐️
2024.01🔥🔥[KVQuant] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization[pdf][KVQuant] ⭐️⭐️Make all KV cache quantized
2024.02[No Token Left Behind] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization[pdf]⭐️⭐️⭐️
2024.02[KIVI] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache[pdf][KIVI] ⭐️⭐️
2024.02[WKVQuant] WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More[pdf]
2024.03[QAQ] QAQ: Quality Adaptive Quantization for LLM KV Cache[pdf][QAQ-KVCacheQuantization] ⭐️attention-based KV cache quantized
2024.05[ZipCache] ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification[pdf]⭐️
2024.05Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression[pdf]⭐️
2024.05[SKVQ] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models[pdf][SKVQ] ⭐️
2024.07[PQCache] PQCache: Product Quantization-based KVCache for Long Context LLM Inference[pdf]⭐️

https://arxiv.org/abs/2402.12065

Evaluation (©️back👆🏻)

<div id="Evaluation"></div>
DateTitlePaperCodeRecomComment
2024.07🔥[Benchmark] KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches[pdf]⭐️

Low Rank KV Cache Decomposition (©️back👆🏻)

<div id="Low-Rank-KV-Cache-Decomposition"></div>
DateTitlePaperCodeRecomComment
2024.02Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference[pdf][LESS] ⭐️⭐️⭐️Fine-tune to make the KV cache low-ranked
2024.05🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model[pdf][DeepSeek-V2] ⭐️⭐️⭐️Train low-rank KV cache from scratch
2024.06[Loki] Loki: Low-Rank Keys for Efficient Sparse Attention[pdf]⭐️

Observation (©️back👆🏻)

<div id="Observation"></div>
DateTitlePaperCodeRecomComment
2022.09In-context Learning and Induction Heads[pdf]⭐️⭐️
2024.01🔥Transformers are Multi-State RNNs[pdf][TOVA] ⭐️⭐️
2024.04🔥[Retrieval Head] Retrieval Head Mechanistically Explains Long-Context Factuality[pdf][Retrieval_Head] ⭐️⭐️⭐️
2024.04🔥[Massive Activations] Massive Activations in Large Language Models[pdf][Massive Activation] ⭐️⭐️⭐️

Systems (©️back👆🏻)

<div id="Systems"></div>
DateTitlePaperCodeRecomComment
2024.06🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI)[pdf][Mooncake] ⭐️⭐️⭐️
2024.02MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool[pdf]⭐️

Others (©️back👆🏻)

<div id="Others"></div>
DateTitlePaperCodeRecomComment
2024.02Effectively Compress KV Heads for LLM[pdf]⭐️
2024.07🔥🔥Q-Sparse: All Large Language Models can be Fully Sparsely-Activated[pdf][GeneralAI] ⭐️⭐️⭐️

©️License

GNU General Public License v3.0

🎉Contribute

Welcome to star & submit a PR to this repo!

@misc{Awesome-LLM-Inference@2024,
  title={Awesome-LLM-KV-Cache: A curated list of Awesome LLM Inference Papers with codes},
  url={https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache},
  note={Open-source software available at https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache},
  author={Zefan-Cai, etc},
  year={2024}
}