Awesome
<div align="center"> <img src="./logo.png" width=90%> </div> <div align="center"> </div>📢 News
🎉 [2024-07-23] Project Beginning 🥳
📜 Notice
This repository is constantly updating 🤗 ...
You can directly click on the title to jump to the corresponding PDF link location
⚙️ Project
- kvpress. NVIDIA.
- This repository implements multiple KV cache pruning methods and benchmarks using 🤗 transformers.
📷 Survey
-
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption. Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai. COLM 2024.
-
Prompt Compression for Large Language Models: A Survey. Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier. Arxiv 2024.
🔍 Method
1️⃣ Pruning / Evicting / Sparse
-
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time. Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava. NeurIPS 2023.
-
SnapKV: LLM Knows What You are Looking for Before Generation. Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen. Arxiv 2024.
-
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen. NeurIPS 2023.
-
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao. ICLR 2024.
-
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference. Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao. ACL 2024.
-
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao. Arxiv 2024.
-
Transformers are Multi-State RNNs. Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, Roy Schwartz. Arxiv 2024.
-
Efficient Streaming Language Models with Attention Sinks. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis. ICLR 2024.
-
A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression. Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini. EMNLP 2024.
-
Retrieval Head Mechanistically Explains Long-Context Factuality. Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu. Arxiv 2024.
-
Efficient Sparse Attention needs Adaptive Token Release. Chaoran Zhang, Lixin Zou, Dan Luo, Min Tang, Xiangyang Luo, Zihao Li, Chenliang Li. ACL 2024.
-
Loki: Low-Rank Keys for Efficient Sparse Attention. Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele. Arxiv 2024.
-
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference. Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen. Arxiv 2024.
-
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching. Youpeng Zhao, Di Wu, Jun Wang. ISCA 2024.
-
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference. Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath. Arxiv 2024.
-
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference. Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou. Arxiv 2024.
-
Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters. Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe. Arxiv 2024.
-
On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference. Siyu Ren, Kenny Q. Zhu. Arxiv 2024.
-
CORM: Cache Optimization with Recent Message for Large Language Model Inference. Jincheng Dai, Zhuowei Huang, Haiyun Jiang, Chen Chen, Deng Cai, Wei Bi, Shuming Shi. Arxiv 2024.
-
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads. Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, Gongyi Wang. Arxiv 2024.
-
ThinK: Thinner Key Cache by Query-Driven Pruning. Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo. Arxiv 2024.
-
A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder. Hyun Rae Jo, Dong Kun Shin. Arxiv 2024.
-
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference. Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han. ICML 2024.
-
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference. Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi. Arxiv 2024.
-
NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time. Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun, Dianhai Yu, Hua Wu. ACL 2024.
-
Post-Training Sparse Attention with Double Sparsity. Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng. Arxiv 2024.
-
Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope. Xiaoran Liu, Qipeng Guo, Yuerong Song, Zhigeng Liu, Kai Lv, Hang Yan, Linlin Li, Qun Liu, Xipeng Qiu. Arxiv 2024.
-
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference. Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti. ICML 2024.
-
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention. Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu. NeurIPS 2024.
-
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers. Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann. NeurIPS 2023.
-
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval. Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu. Arxiv 2024.
-
Sirius: Contextual Sparsity with Correction for Efficient LLMs. Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, Beidi Chen. Arxiv 2024.
-
Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU. Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo. Arxiv 2024.
-
Training-Free Activation Sparsity in Large Language Models. James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun. Arxiv 2024.
-
KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models. Bo Lv, Quan Zhou, Xuanang Ding, Yan Wang, Zeming Ma. Arxiv 2024.
-
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs. Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie. Arxiv 2024.
-
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction. Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty. Arxiv 2024.
-
KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head. Isaac Rehg. Arxiv 2024.
-
InfiniPot: Infinite Context Processing on Memory-Constrained LLMs. Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang. EMNLP 2024.
-
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads. Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu. Arxiv 2024.
-
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference. Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang. Arxiv 2024.
-
LoCoCo: Dropping In Convolutions for Long Context Compression. Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen. ICML 2024.
-
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads. Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han. Arxiv 2024.
-
SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction. Xuan Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin. Arxiv 2024.
-
In-context KV-Cache Eviction for LLMs via Attention-Gate. Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng. Arxiv 2024.
-
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang. ACM SIGCOMM 2024.
-
MagicPIG: LSH Sampling for Efficient LLM Generation. Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen. Arxiv 2024.
-
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention. Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia. Arxiv 2024.
-
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference. Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen. Arxiv 2024.
-
BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference. Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He. Arxiv 2024.
-
CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling. Yu Bai, Xiyuan Zou, Heyan Huang, Sanxing Chen, Marc-Antoine Rondeau, Yang Gao, Jackie Chi Kit Cheung. EMNLP 2024.
-
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection. Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Kun Fu, Zheng Wang, Hui Xiong. Arxiv 2024.
-
Recycled Attention: Efficient inference for long-context language models. Fangyuan Xu, Tanya Goyal, Eunsol Choi. Arxiv 2024.
-
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration. Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, Panpan Xu. Arxiv 2024.
-
Squeezed Attention: Accelerating Long Context Length LLM Inference. Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami. Arxiv 2024.
2️⃣ Merging
-
D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models. Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi Zhang. Arxiv 2024.
-
Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks. Zheng Wang, Boxiao Jin, Zhongzhi Yu, Minjia Zhang. Arxiv 2024.
-
CaM: Cache Merging for Memory-efficient LLMs Inference. Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, Rongrong Ji. ICML 2024.
-
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs. Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin. ICLR 2024.
-
Token Merging: Your ViT But Faster. Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman. ICLR 2023.
-
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference. Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan. EMNLP 2024.
-
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention. Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal. Arxiv 2024.
-
Compressed Context Memory for Online Language Model Interaction. Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song. ICLR 2024.
-
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang. EuroSys 2025.
3️⃣ Cross-Layer
-
You Only Cache Once: Decoder-Decoder Architectures for Language Models. Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei. NeurIPS 2024.
-
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention. William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly. Arxiv 2024.
-
Layer-Condensed KV Cache for Efficient Inference of Large Language Models. Haoyi Wu, Kewei Tu. ACL 2024.
-
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models. Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang. Arxiv 2024.
-
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding. Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji. Arxiv 2024.
-
A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference. You Wu, Haoyi Wu, Kewei Tu. Arxiv 2024.
-
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing. Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen. Arxiv 2024.
-
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation. Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He. Arxiv 2024.
4️⃣ Low-Rank
-
Fast Transformer Decoding: One Write-Head is All You Need. Noam Shazeer. Arxiv 2019.
-
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai. EMNLP 2023.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. DeepSeek-AI. Arxiv 2024.
-
Effectively Compress KV Heads for LLM. Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu. Arxiv 2024.
-
Palu: Compressing KV-Cache with Low-Rank Projection. Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu. Arxiv 2024.
-
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy. Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen. Arxiv 2024.
5️⃣ Quantization
-
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification. Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang. Arxiv 2024.
-
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization. June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee. Arxiv 2024.
-
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu. ICML 2024.
-
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM. Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao. Arxiv 2024.
-
PQCache: Product Quantization-based KVCache for Long Context LLM Inference. Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui. Arxiv 2024.
-
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression. Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen. Arxiv 2024.
-
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models. Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin. Arxiv 2024.
-
QAQ: Quality Adaptive Quantization for LLM KV Cache. Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang. Arxiv 2024.
-
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami. NeurIPS 2024.
-
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More. Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, Liqiang Nie. Arxiv 2024.
6️⃣ Prompt Compression
-
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu. EMNLP 2023.
-
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression. Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Dongmei Zhang. ACL 2024.
-
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu. ACL 2024.
-
TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning. Shivam Shandilya, Menglin Xia, Supriyo Ghosh, Huiqiang Jiang, Jue Zhang, Qianhui Wu, Victor Rühle. Arxiv 2024.
📊 Evaluation
- KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches. Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu. EMNLP 2024.