Awesome
Awesome-Efficient-LLM
A curated list for Efficient Large Language Models
Full List
- Network Pruning / Sparsity
- Knowledge Distillation
- Quantization
- Inference Acceleration
- Efficient MOE
- Efficient Architecture of LLM
- KV Cache Compression
- Text Compression
- Low-Rank Decomposition
- Hardware / System
- Tuning
- Survey
- Leaderboard
Please check out all the papers by selecting the sub-area you're interested in. On this main page, we're showing papers released in the past 90 days.
🚀 Updates
- May 29, 2024: We've had this awesome list for a year now :smiling_face_with_three_hearts:! It's grown pretty long, so we're reorganizing it and would divide the list by their specific areas into different readme.
- Sep 27, 2023: Add tag for papers accepted at NeurIPS'23.
- Sep 6, 2023: Add a new subdirectory project/ to organize those projects that are designed for developing a lightweight LLM.
- July 11, 2023: In light of the numerous publications that conduct experiments using PLMs (such as BERT, BART) currently, a new subdirectory efficient_plm/ is created to house papers that are applicable to PLMs but have yet to be verified for their effectiveness on LLMs (not implying that they are not suitable on LLM).
💮 Contributing
If you'd like to include your paper, or need to update any details such as conference information or code URLs, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in generate_item.py
and execute python generate_item.py
. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience.
:star: Recommended Paper
For each topic, we have curated a list of recommended papers that have garnered relatively high GitHub stars or citations.
Paper from June 13, 2024 - Now (see Full List from May 22, 2023 here)
Quick Link
- Network Pruning / Sparsity
- Knowledge Distillation
- Quantization
- Inference Acceleration
- Efficient MOE
- Efficient Architecture of LLM
- KV Cache Compression
- Text Compression
- Low-Rank Decomposition
- Hardware / System
- Tuning
- Survey
Network Pruning / Sparsity
Title & Authors | Introduction | Links |
---|---|---|
<br> :star: SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot <br> Elias Frantar, Dan Alistarh | <img width="522" alt="image" src="figures/sparsegpt.png"> | Github paper |
<br> :star: LLM-Pruner: On the Structural Pruning of Large Language Models <br> Xinyin Ma, Gongfan Fang, Xinchao Wang | <img width="561" alt="image" src="figures/llm_pruner.png"> | Github paper |
<br> :star: A Simple and Effective Pruning Approach for Large Language Models <br> Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter | <img width="1002" alt="image" src="https://user-images.githubusercontent.com/20168304/245999360-f951de47-269d-491d-826a-8e6d85627849.png"> | Github <br> Paper |
<br> :star: Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning <br> Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen | <img width="1002" alt="image" src="figures/LLM-shearing.png"> | Github <br> Paper |
STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning <br> Jaeseong Lee, seung-won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He | <img width="1002" alt="image" src="https://arxiv.org/html/2409.06211v1/x1.png"> | Paper |
<br>PAT: Pruning-Aware Tuning for Large Language Models <br> Yijiang Liu, Huanrui Yang, Youxin Chen, Rongyu Zhang, Miao Wang, Yuan Du, Li Du | <img width="1002" alt="image" src="figures/PAT.png"> | Github <br> Paper |
LLM Pruning and Distillation in Practice: The Minitron Approach <br> Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov | <img width="1002" alt="image" src="https://arxiv.org/html/2408.11796v2/x1.png"> | Paper |
Language-specific Calibration for Pruning Multilingual Language Models <br> Simon Kurz, Zhixue Zhao, Jian-Jia Chen, Lucie Flek | Paper | |
<br>LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models <br> Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Graziano Chesi, Ngai Wong, Hao Yu | <img width="1002" alt="image" src="https://github.com/YupengSu/LLM-Barber/raw/main/img/figure1a.png"> | Github <br> Paper |
Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism <br> Guanchen Li, Xiandong Zhao, Lian Liu, Zeping Li, Dong Li, Lu Tian, Jie He, Ashish Sirasao, Emad Barsoum | <img width="1002" alt="image" src="https://arxiv.org/html/2408.10473v1/x1.png"> | Paper |
A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models <br> Pengxiang Zhao, Hanyu Hu, Ping Li, Yi Zheng, Zhefeng Wang, Xiaoming Yuan | <img width="1002" alt="image" src="https://arxiv.org/html/2408.03728v1/x1.png"> | Paper |
Pruning Large Language Models with Semi-Structural Adaptive Sparse Training <br> Weiyu Huang, Guohao Jian, Yuezhou Hu, Jun Zhu, Jianfei Chen | <img width="1002" alt="image" src="https://arxiv.org/html/2407.20584v1/extracted/5756562/4.png"> | Paper |
Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining <br> Jianwei Li, Yijun Dong, Qi Lei | <img width="1002" alt="image" src="https://arxiv.org/html/2407.19126v1/x2.png"> | Paper |
<br>Compact Language Models via Pruning and Knowledge Distillation <br> Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov | <img width="1002" alt="image" src="https://arxiv.org/html/2407.14679v1/x2.png"> | Github <br> Paper |
MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models <br> Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi | <img width="1002" alt="image" src="figures/minillm.png"> | Paper |
Reconstruct the Pruned Model without Any Retraining <br> Pingjie Wang, Ziqing Fan, Shengchao Hu, Zhe Chen, Yanfeng Wang, Yu Wang | <img width="1002" alt="image" src="https://arxiv.org/html/2407.13331v1/x3.png"> | Paper |
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated <br> Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei | <img width="1002" alt="image" src="https://arxiv.org/html/2407.10969v1/x3.png"> | Paper |
<br>Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations <br> Bowen Shen, Zheng Lin, Daren Zha, Wei Liu, Jian Luan, Bin Wang, Weiping Wang | <img width="1002" alt="image" src="https://arxiv.org/html/2407.05690v1/x2.png"> | Github <br> Paper |
<br>Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression <br> Zhichao Xu, Ashim Gupta, Tao Li, Oliver Bentham, Vivek Srikumar | Github <br> Paper | |
<br>Flextron: Many-in-One Flexible Large Language Model <br> Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov | <img width="1002" alt="image" src="https://arxiv.org/html/2406.10260v1/x1.png"> | Paper |
<br>BlockPruner: Fine-grained Pruning for Large Language Models <br> Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, Liangzhi Li | <img width="1002" alt="image" src="https://arxiv.org/html/2406.10594v2/x3.png"> | Github <br> Paper |
<br>Structured Pruning for Large Language Models Using Coupled Components Elimination and Minor Fine-tuning <br> Honghe Zhang, XiaolongShi XiaolongShi, Jingwei Sun, Guangzhong Sun | <img width="1002" alt="image" src="figures/CCEMF.png"> | Paper |
FoldGPT: Simple and Effective Large Language Model Compression Scheme <br> Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen | <img width="1002" alt="image" src="https://arxiv.org/html/2407.00928v1/extracted/5701554/flodGPT.png"> | Paper |
<br>Learning Neural Networks with Sparse Activations <br> Pranjal Awasthi, Nishanth Dikkala, Pritish Kamath, Raghu Meka | Paper | |
Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization <br> Sungbin Shin, Wonpyo Park, Jaeho Lee, Namhoon Lee | <img width="1002" alt="image" src="https://arxiv.org/html/2406.15524v1/x3.png"> | Paper |
<br>ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models <br> Yash Akhauri, Ahmed F AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M Rush, Safeen Huda, Mohamed S Abdelfattah | <img width="1002" alt="image" src="https://arxiv.org/html/2406.16635v1/x4.png"> | Github <br> Paper |
Optimization-based Structural Pruning for Large Language Models without Back-Propagation <br> Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia | <img width="1002" alt="image" src="https://arxiv.org/html/2406.10576v1/extracted/5669159/imgs/overview5.png"> | Paper |
ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models <br> Xiang Meng, Kayhan Behdin, Haoyue Wang, Rahul Mazumder | <img width="1002" alt="image" src="figures/ALPS.png"> | Paper |
Knowledge Distillation
Title & Authors | Introduction | Links |
---|---|---|
:star: Knowledge Distillation of Large Language Models <br> Yuxian Gu, Li Dong, Furu Wei, Minlie Huang | <img width="1002" alt="image" src="https://github.com/microsoft/LMOps/blob/main/minillm/figures/method.png"> | Github <br> Paper |
<br>The Mamba in the Llama: Distilling and Accelerating Hybrid Models <br> Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri Dao | <img width="1002" alt="image" src="https://arxiv.org/html/2408.15237v1/x1.png"> | Github <br> Paper |
FIRST: Teach A Reliable Large Language Model Through Efficient Trustworthy Distillation <br> KaShun Shum, Minrui Xu, Jianshu Zhang, Zixin Chen, Shizhe Diao, Hanze Dong, Jipeng Zhang, Muhammad Omer Raza | <img width="1002" alt="image" src="https://arxiv.org/html/2408.12168v1/extracted/5806746/Figures/trustworthy.png"> | Paper |
Interactive DualChecker for Mitigating Hallucinations in Distilling Large Language Models <br> Meiyun Wang, Masahiro Suzuki, Hiroki Sakaji, Kiyoshi Izumi | <img width="1002" alt="image" src="https://arxiv.org/html/2408.12326v1/extracted/5806761/figs/intro.jpg"> | Paper |
Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models <br> Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, Albert Gu | <img width="1002" alt="image" src="https://arxiv.org/html/2408.10189v1/x1.png"> | Paper |
Concept Distillation from Strong to Weak Models via Hypotheses-to-Theories Prompting <br> Emmanuel Aboah Boateng, Cassiano O. Becker, Nabiha Asghar, Kabir Walia, Ashwin Srinivasan, Ehi Nosakhare, Victor Dibia, Soundar Srinivasan | <img width="1002" alt="image" src="https://arxiv.org/html/2408.09365v1/x2.png"> | Paper |
LaDiMo: Layer-wise Distillation Inspired MoEfier <br> Sungyoon Kim, Youngjun Kim, Kihyo Moon, Minsung Jang | <img width="1002" alt="image" src="https://arxiv.org/html/2408.04278v1/extracted/5780689/figures/moefier.png"> | Paper |
BOND: Aligning LLMs with Best-of-N Distillation <br> Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Nino Vieillard et al | <img width="1002" alt="image" src="figures/BOND.png"> | Paper |
Enhancing Data-Limited Graph Neural Networks by Actively Distilling Knowledge from Large Language Models <br> Quan Li, Tianxiang Zhao, Lingwei Chen, Junjie Xu, Suhang Wang | <img width="1002" alt="image" src="https://arxiv.org/html/2407.13989v1/x2.png"> | Paper |
DDK: Distilling Domain Knowledge for Efficient Large Language Models <br> Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, Zhiqi Bai, Jie Liu, Ge Zhang, Jiakai Wang, Yanan Wu, Congnan Liu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng | <img width="1002" alt="image" src="https://arxiv.org/html/2407.16154v1/x2.png"> | Paper |
Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model <br> Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang | <img width="1002" alt="image" src="https://arxiv.org/html/2407.10167v1/x2.png"> | Paper |
Don't Throw Away Data: Better Sequence Knowledge Distillation <br> Jun Wang, Eleftheria Briakou, Hamid Dadkhahi, Rishabh Agarwal, Colin Cherry, Trevor Cohn | Paper | |
Multi-Granularity Semantic Revision for Large Language Model Distillation <br> Xiaoyu Liu, Yun Zhang, Wei Li, Simiao Li, Xudong Huang, Hanting Chen, Yehui Tang, Jie Hu, Zhiwei Xiong, Yunhe Wang | <img width="1002" alt="image" src="https://arxiv.org/html/2407.10068v1/x1.png"> | Paper |
BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation <br> Minchong Li, Feng Zhou, Xiaohui Song | <img width="1002" alt="image" src="https://arxiv.org/html/2406.13555v1/extracted/5678562/images/bild.jpg"> | Paper |
Quantization
Title & Authors | Introduction | Links |
---|---|---|
<br> :star: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers <br> Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh | <img width="202" alt="image" src="figures/GPTQ.png"> | Github <br> Paper |
<br> :star: SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models <br> Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han | <img width="1002" alt="image" src="https://github.com/mit-han-lab/smoothquant/blob/main/figures/intuition.png"> | Github <br> Paper |
<br> :star: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration <br> Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Song Han | <img width="1002" alt="image" src="https://github.com/mit-han-lab/llm-awq/blob/main/figures/overview.png"> | Github <br> Paper |
<br> :star: OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models <br> Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo | <img width="1002" alt="image" src="figures/omniquant.png"> | Github <br> Paper |
<br> :star: SqueezeLLM: Dense-and-Sparse Quantization <br>Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer | <img width="1102" alt="image" src="figures/SqueezeLLM.png"> | Github <br> Paper |
<br> :star: Extreme Compression of Large Language Models via Additive Quantization <br> Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh | <img width="1002" alt="image" src="figures/MCQ.png"> | Github <br> Paper |
The Uniqueness of LLaMA3-70B with Per-Channel Quantization: An Empirical Study <br> Minghai Qin | <img width="1002" alt="image" src="https://arxiv.org/html/2408.15301v1/extracted/5797059/LaTeX/figures/llama3-70b-series-accuracy.png"> | Paper |
Matmul or No Matmal in the Era of 1-bit LLMs <br> Jinendra Malekar, Mohammed E. Elbtity, Ramtin Zand Co | <img width="1002" alt="image" src="https://arxiv.org/html/2408.11939v1/extracted/5805924/figures/matmul.png"> | Paper |
<br>MobileQuant: Mobile-friendly Quantization for On-device Language Models <br> Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez | <img width="1002" alt="image" src="https://arxiv.org/html/2408.13933v1/x1.png"> | Github <br> Paper |
<br>ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models <br> Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei | <img width="1002" alt="image" src="figures/abq-llm.png"> | Github <br> Paper |
STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs <br> Peijie Dong, Lujun Li, Dayou Du, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Wenhan Luo, Qifeng Liu, Yike Guo, Xiaowen Chu | <img width="1002" alt="image" src="https://arxiv.org/html/2408.01803v1/extracted/5772020/pic/basic_block.png"> | Paper |
<br>Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance <br> Ao Shen, Qiang Wang, Zhiquan Lai, Xionglve Li, Dongsheng Li | <img width="1002" alt="image" src="figures/Q-BaRA.png"> | Github <br> Paper |
<br>Scalify: scale propagation for efficient low-precision LLM training <br> Paul Balança, Sam Hosegood, Carlo Luschi, Andrew Fitzgibbon | Github <br> Paper | |
<br>EfficientQAT: Efficient Quantization-Aware Training for Large Language Models <br> Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo | <img width="1002" alt="image" src="https://arxiv.org/html/2407.11062v1/x5.png"> | Github <br> Paper |
<br>LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices <br> Jung Hyun Lee, Jeonghoon Kim, June Yong Yang, Se Jung Kwon, Eunho Yang, Kang Min Yoo, Dongsoo Lee | <img width="1002" alt="image" src="https://arxiv.org/html/2407.11534v1/extracted/5734567/Figures/Fig_ablation_samplesize_flexround.png"> | Github <br> Paper |
<br>Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models <br> Ayush Kaushal, Tejas Pandey, Tejas Vaidhya, Aaryan Bhagat, Irina Rish | <img width="1002" alt="image" src="https://arxiv.org/html/2407.11722v1/x1.png"> | Github <br> Paper |
<br>Fast Matrix Multiplications for Lookup Table-Quantized LLMs <br> Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim | <img width="302" alt="image" src="https://arxiv.org/html/2407.10960v1/x1.png"> | Github <br> Paper |
LeanQuant: Accurate Large Language Model Quantization with Loss-Error-Aware Grid <br> Tianyi Zhang, Anshumali Shrivastava | <img width="1002" alt="image" src="https://arxiv.org/html/2407.10032v1/x2.png"> | Paper |
Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization <br> Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, Jaeho Lee | <img width="1002" alt="image" src="https://arxiv.org/html/2406.12016v1/extracted/5669665/figures/mainfig.png"> | Paper |
<br>RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization <br> Xijie Huang, Zechun Liu, Shih-Yang Liu, Kwang-Ting Cheng | <img width="1002" alt="image" src="https://arxiv.org/html/2407.08044v1/x1.png"> | Github <br> Paper |
<br>FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation <br> Liqun Ma, Mingjie Sun, Zhiqiang Shen | <img width="1002" alt="image" src="https://github.com/LiqunMa/FBI-LLM/blob/main/figures/structure_and_training_procedure.png"> | Github <br> Paper |
<br>GPTQT: Quantize Large Language Models Twice to Push the Efficiency <br> Yipin Guo, Yilin Lang, Qinyuan Ren | <img width="1002" alt="image" src="https://arxiv.org/html/2407.02891v1/x1.png"> | Paper |
<br>T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge <br> Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang | <img width="1002" alt="image" src="https://arxiv.org/html/2407.00088v1/x2.png"> | Github <br> Paper |
<br>Variable Layer-Wise Quantization: A Simple and Effective Approach to Quantize LLMs <br> Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, Mihai Surdeanu | <img width="202" alt="image" src="https://arxiv.org/html/2406.17415v1/x1.png"> | Github <br> Paper |
CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent <br> Pranav Ajit Nair, Arun Sai Suggala | <img width="1002" alt="image" src="figures/CD.png"> | Paper |
SDQ: Sparse Decomposed Quantization for LLM Inference <br> Geonhwa Jeong, Po-An Tsai, Stephen W. Keckler, Tushar Krishna | <img width="1002" alt="image" src="https://arxiv.org/html/2406.13868v1/x3.png"> | Paper |
Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization <br> Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, Jaeho Lee | <img width="1002" alt="image" src="https://arxiv.org/html/2406.12016v1/extracted/5669665/figures/mainfig.png"> | Paper |
Attention-aware Post-training Quantization without Backpropagation <br> Junhan Kim, Ho-young Kim, Eulrang Cho, Chungman Lee, Joonyoung Kim, Yongkweon Jeon | <img width="1002" alt="image" src="https://arxiv.org/html/2406.13474v1/x1.png"> | Paper |
Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models <br> Dongwon Jo, Taesu Kim, Yulhwa Kim, Jae-Joon Kim | <img width="1002" alt="image" src="https://arxiv.org/html/2406.12311v1/x2.png"> | Paper |
<br>QQQ: Quality Quattuor-Bit Quantization for Large Language Models <br> Ying Zhang, Peng Zhang, Mincong Huang, Jingyang Xiang, Yujie Wang, Chao Wang, Yineng Zhang, Lei Yu, Chuan Liu, Wei Lin | <img width="202" alt="image" src="https://arxiv.org/html/2406.09904v1/x1.png"> | Github <br> Paper |
Inference Acceleration
Title & Authors | Introduction | Links |
---|---|---|
<br> :star: Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time <br> Zichang Liu, Jue WANG, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen | <img width="202" alt="image" src="figures/DajeVu.png"> | Github <br> Paper |
<br> :star: SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification <br> Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia | <img width="600" alt="image" src="https://github.com/flexflow/FlexFlow/blob/inference/img/overview.png"> | Github <br> paper |
<br> :star: Efficient Streaming Language Models with Attention Sinks <br> Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis | <img width="1002" alt="image" src="https://github.com/mit-han-lab/streaming-llm/blob/main/figures/schemes.png"> | Github <br> Paper |
<br>:star: EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation <br> Yuhui Li, Chao Zhang, and Hongyang Zhang | <img width="302" alt="image" src="https://github.com/SafeAILab/EAGLE/blob/main/figs/fig1.png"> | Github <br> Blog |
<br> :star: Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads <br> Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao | <img width="1002" alt="image" src="https://arxiv.org/html/2401.10774v1/x1.png"> | Github <br> Paper |
<br>Sirius: Contextual Sparsity with Correction for Efficient LLMs <br> Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, Beidi Chen | <img width="1002" alt="image" src="https://infini-ai-lab.github.io/Sirius/static/images/methodsillustration.png"> | Github <br> Paper |
<br>OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs <br> Jintian Zhang, Cheng Peng, Mengshu Sun, Xiang Chen, Lei Liang, Zhiqiang Zhang, Jun Zhou, Huajun Chen, Ningyu Zhang | <img width="1002" alt="image" src="https://github.com/zjunlp/OneGen/blob/main/assets/train.jpg"> | Github <br> Paper |
Path-Consistency: Prefix Enhancement for Efficient Inference in LLM <br> Jiace Zhu, Yingtao Shen, Jie Zhao, An Zou | <img width="1002" alt="image" src="https://arxiv.org/html/2409.01281v1/x1.png"> | Paper |
Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation <br> Lujun Gui, Bin Xiao, Lei Su, Weipeng Chen | <img width="1002" alt="image" src="https://arxiv.org/html/2408.15562v1/extracted/5818109/structure_0.png"> | Paper |
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling <br> Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che | <img width="202" alt="image" src="https://arxiv.org/html/2408.08696v1/x1.png"> | Paper |
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion <br> Jacob K Christopher, Brian R Bartoldson, Bhavya Kailkhura, Ferdinando Fioretto | <img width="1002" alt="image" src="https://arxiv.org/html/2408.05636v1/x1.png"> | Paper |
<br>Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding <br> Bin Xiao, Lujun Gui, Lei Su, Weipeng Chen | <img width="1002" alt="image" src="https://github.com/XiaoBin1992/clover/raw/v1/figs/structure.png"> | Github <br> Paper |
Accelerating Large Language Model Inference with Self-Supervised Early Exits <br> Florian Valade | Paper | |
An Efficient Inference Framework for Early-exit Large Language Models <br> Ruijie Miao, Yihan Yan, Xinshuo Yao, Tong Yang | Paper | |
<br>Inference acceleration for large language models using "stairs" assisted greedy generation <br> Domas Grigaliūnas, Mantas Lukoševičius | <img width="1002" alt="image" src="https://arxiv.org/html/2407.19947v1/extracted/5761251/assist_inf_2.png"> | Paper |
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference <br> Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi | <img width="1002" alt="image" src="https://arxiv.org/html/2407.14057v1/x1.png"> | Paper |
Adaptive Draft-Verification for Efficient Large Language Model Decoding <br> Xukun Liu, Bowen Lei, Ruqi Zhang, Dongkuan Xu | <img width="1002" alt="image" src="https://arxiv.org/html/2407.12021v1/x1.png"> | Paper |
Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference <br> Zongyue Qin, Ziniu Hu, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun | <img width="1002" alt="image" src="https://arxiv.org/html/2407.09722v1/x1.png"> | Paper |
<br>LiveMind: Low-latency Large Language Models with Simultaneous Inference <br> Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann, Bing Li | <img width="1002" alt="image" src="https://arxiv.org/html/2406.14319v1/x1.png"> | Github <br> Paper |
S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models <br> Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh | <img width="1002" alt="image" src="https://arxiv.org/html/2407.01955v1/x1.png"> | Paper |
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers <br> Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu | <img width="1002" alt="image" src="https://arxiv.org/html/2406.16747v1/x1.png"> | Paper |
<br>EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees <br> Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang | <img width="1002" alt="image" src="https://arxiv.org/html/2406.16858v1/x4.png"> | Github <br> Paper |
Interpreting Attention Layer Outputs with Sparse Autoencoders <br> Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda | <img width="1002" alt="image" src="https://arxiv.org/html/2406.17759v1/x1.png"> | Paper |
Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention <br> Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Huanqi Cao, Xiao Chuanfu, Xingcheng Zhang, Dahua Lin, Chao Yang | <img width="1002" alt="image" src="https://arxiv.org/html/2406.15486v1/x1.png"> | Paper |
<br>MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression <br> Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen et al | <img width="1002" alt="image" src="https://github.com/thu-nics/MoA/blob/master/assets/workflow.png"> | Github <br> Paper |
Optimized Speculative Sampling for GPU Hardware Accelerators <br> Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet | <img width="1002" alt="image" src="https://arxiv.org/html/2406.11016v1/x1.png"> | Paper |
HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning <br> Heejun Lee, Geon Park, Youngwan Lee, Jina Kim, Wonyoung Jeong, Myeongjae Jeon, Sung Ju Hwang | <img width="1002" alt="image" src="https://arxiv.org/html/2406.09827v1/x1.png"> | Paper |
Efficient MOE
Title & Authors | Introduction | Links |
---|---|---|
<br>:star: Fast Inference of Mixture-of-Experts Language Models with Offloading <br> Artyom Eliseev, Denis Mazur | <img width="1002" alt="image" src="figures/mixtral_offloading.png"> | Github <br> Paper |
Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts <br> Zeliang Zhang, Xiaodong Liu, Hao Cheng, Chenliang Xu, Jianfeng Gao | <img width="1002" alt="image" src="https://arxiv.org/html/2407.09590v1/x3.png"> | Paper |
<br>Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs <br> Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang | <img width="1002" alt="image" src="https://arxiv.org/html/2407.00945v1/extracted/5697370/Figures/use_case.png"> | Github <br> Paper |
<br>Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark <br> Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen | <img width="1002" alt="image" src="https://arxiv.org/html/2406.08155v1/x1.png"> | Github <br> Paper |
ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models <br> Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang | <img width="1002" alt="image" src="https://arxiv.org/html/2406.09041v1/x1.png"> | Paper |
Efficient Architecture of LLM
Title & Authors | Introduction | Links |
---|---|---|
<br>:star: MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT <br> Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan | <img width="402" alt="image" src="https://github.com/mbzuai-oryx/MobiLlama/raw/main/images/mobillama_generation.gif"> | Github <br> Paper <br>Model |
<br>:star: Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length <br> Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou | <img width="1002" alt="image" src="figures/megalodon.png"> | Github <br> Paper |
SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context <br> Hongjun An, Yifan Chen, Zhe Sun, Xuelong Li | <img width="1002" alt="image" src="https://arxiv.org/html/2408.00655v4/x2.png"> | Paper |
<br>Efficient LLM Training and Serving with Heterogeneous Context Sharding among Attention Heads <br> Xihui Lin, Yunan Zhang, Suyu Ge, Barun Patra, Vishrav Chaudhary, Xia Song | <img width="1002" alt="image" src="https://github.com/linxihui/dkernel/raw/main/assets/localstride.png"> | Github <br> Paper |
<br>Beyond KV Caching: Shared Attention for Efficient LLMs <br> Bingli Liao, Danilo Vasconcellos Vargas | <img width="1002" alt="image" src="https://arxiv.org/html/2407.12866v1/x1.png"> | Github <br> Paper |
KV Cache Compression
Title & Authors | Introduction | Links |
---|---|---|
:star: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs <br> Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao | <img width="1002" alt="image" src="figures/FastGen.png"> | Paper |
A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage <br> Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu | <img width="1002" alt="image" src="https://arxiv.org/html/2409.04040v1/x3.png"> | Paper |
<br>Post-Training Sparse Attention with Double Sparsity <br> Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng | <img width="302" alt="image" src="https://github.com/andy-yang-1/DoubleSparse/raw/main/assets/double-sparsity-gif-v2.gif"> | Github <br> Paper |
<br>Eigen Attention: Attention in Low-Rank Space for KV Cache Compression <br> Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy | <img width="1002" alt="image" src="https://arxiv.org/html/2408.05646v1/x1.png"> | Github <br> Paper |
Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference <br> Zeyu Zhang,Haiying Shen | <img width="1002" alt="image" src="https://arxiv.org/html/2408.04107v1/x15.png"> | Paper |
Finch: Prompt-guided Key-Value Cache Compression <br> Giulio Corallo, Paolo Papotti | <img width="1002" alt="image" src="https://arxiv.org/html/2408.00167v1/extracted/5763688/assets/diagram_finch.png"> | Paper |
<br>Palu: Compressing KV-Cache with Low-Rank Projection <br> Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu | <img width="1002" alt="image" src="https://github.com/shadowpa0327/Palu/blob/master/img/palu_idea.png"> | Github <br> Paper |
ThinK: Thinner Key Cache by Query-Driven Pruning <br> Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo | <img width="1002" alt="image" src="https://arxiv.org/html/2407.21018v1/x1.png"> | Paper |
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads <br> Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, Gongyi Wang | <img width="1002" alt="image" src="https://arxiv.org/html/2407.15891v1/x3.png"> | Paper |
PQCache: Product Quantization-based KVCache for Long Context LLM Inference <br> Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui | <img width="1002" alt="image" src="https://arxiv.org/html/2407.12820v1/extracted/5702744/Figures/transformer.png"> | Paper |
<br>GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression <br> Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, Eugene Cheah | <img width="202" alt="image" src="https://github.com/recursal/GoldFinch-paper/raw/main/assets/architecture.png"> | Github <br> Paper |
<br>Efficient Sparse Attention needs Adaptive Token Release <br> Chaoran Zhang, Lixin Zou, Dan Luo, Min Tang, Xiangyang Luo, Zihao Li, Chenliang Li | <img width="1002" alt="image" src="https://arxiv.org/html/2407.02328v1/x1.png"> | Github <br> Paper |
<br>KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches <br> Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li et al | <img width="1002" alt="image" src="figures/longctx_bench.png"> | Github <br> Paper |
<br>MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding <br> Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji | <img width="1002" alt="image" src="https://arxiv.org/html/2406.09297v1/extracted/5665367/resources/mlkv-All_KV.png"> | Github <br> Paper |
Text Compression
Title & Authors | Introduction | Links |
---|---|---|
<br>:star: LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models <br> Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu | <img width="1002" alt="image" src="https://github.com/microsoft/LLMLingua/blob/main/images/LLMLingua_framework.png"> | Github <br> Paper |
<br>:star: LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression <br> Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu | <img width="1002" alt="image" src="figures/longllmlingua.png"> | Github <br> Paper |
Efficient LLM Context Distillation <br> Rajesh Upadhayayaya, Zachary Smith, Chritopher Kottmyer, Manish Raj Osti | Paper | |
<br>Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression <br> Haowen Hou, Fei Ma, Binwen Bai, Xinxin Zhu, Fei Yu | <img width="1002" alt="image" src="https://arxiv.org/html/2408.15491v1/extracted/5817813/arch.png"> | Github <br> Paper |
<br>500xCompressor: Generalized Prompt Compression for Large Language Models <br> Zongqian Li, Yixuan Su, Nigel Collier | <img width="1002" alt="image" src="https://arxiv.org/html/2408.03094v1/extracted/5776907/Figures/0-1.png"> | Github <br> Paper |
<br>QUITO: Accelerating Long-Context Reasoning through Query-Guided Context Compression <br> Wenshan Wang, Yihang Wang, Yixing Fan, Huaming Liao, Jiafeng Guo | <img width="1002" alt="image" src="https://github.com/Wenshansilvia/attention_compressor/blob/main/assets/method.png"> | Github <br> Paper |
<br>Characterizing Prompt Compression Methods for Long Context Inference <br> Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami | <img width="1002" alt="image" src="https://arxiv.org/html/2407.08892v1/x3.png"> | Paper |
Entropy Law: The Story Behind Data Compression and LLM Performance <br> Mingjia Yin, Chuhan Wu, Yufei Wang, Hao Wang, Wei Guo, Yasheng Wang, Yong Liu, Ruiming Tang, Defu Lian, Enhong Chen | <img width="1002" alt="image" src="https://arxiv.org/html/2407.06645v1/x1.png"> | Paper |
PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning <br> Jiaru Zou, Mengyu Zhou, Tao Li, Shi Han, Dongmei Zhang | <img width="1002" alt="image" src="https://arxiv.org/html/2407.02211v1/x2.png"> | Paper |
Brevity is the soul of wit: Pruning long files for code generation <br> Aaditya K. Singh, Yu Yang, Kushal Tirumala, Mostafa Elhoushi, Ari S. Morcos | <img width="1002" alt="image" src="https://arxiv.org/html/2407.00434v1/x1.png"> | Paper |
Low-Rank Decomposition
Title & Authors | Introduction | Links |
---|---|---|
MoDeGPT: Modular Decomposition for Large Language Model Compression <br> Chi-Heng Lin, Shangqian Gao, James Seale Smith, Abhishek Patel, Shikhar Tuli, Yilin Shen, Hongxia Jin, Yen-Chang Hsu | <img width="1002" alt="image" src="https://arxiv.org/html/2408.09632v1/x2.png"> | Paper |
MCNC: Manifold Constrained Network Compression <br> Chayne Thrash, Ali Abbasi, Parsa Nooralinejad, Soroush Abbasi Koohpayegani, Reed Andreas, Hamed Pirsiavash, Soheil Kolouri | <img width="1002" alt="image" src="https://arxiv.org/html/2406.19301v1/x1.png"> | Paper |
Hardware/System
Title & Authors | Introduction | Links |
---|---|---|
<br>OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models <br> Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung | <img width="1002" alt="image" src="https://arxiv.org/html/2409.05902v1/x5.png"> | Paper |
Accelerating Large Language Model Training with Hybrid GPU-based Compression <br> Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha R. Gulhane, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda | <img width="1002" alt="image" src="https://arxiv.org/html/2409.02423v1/extracted/5832005/Figures/mzhybrid-3d-rev.png"> | Paper |
LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration <br> Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang | <img width="1002" alt="image" src="https://arxiv.org/html/2408.06003v1/x5.png"> | Paper |
Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference <br> Rohan Baskar Prabhakar, Hengrui Zhang, David Wentzlaff | <img width="1002" alt="image" src="https://arxiv.org/html/2408.07802v2/x2.png"> | Paper |
SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving <br> Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris | <img width="1002" alt="image" src="https://arxiv.org/html/2408.05235v1/x16.png"> | Paper |
Designing Efficient LLM Accelerators for Edge Devices <br> Jude Haris, Rappy Saha, Wenhao Hu, José Cano | <img width="1002" alt="image" src="https://arxiv.org/html/2408.00462v1/extracted/5768368/files/SECDA_meth.png"> | Paper |
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation <br> Branden Butler, Sixing Yu, Arya Mazaheri, Ali Jannesari | <img width="1002" alt="image" src="https://arxiv.org/html/2407.11798v1/x1.png"> | Paper |
<br>FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision <br> Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao | <img width="1002" alt="image" src="figures/flashattention3.png"> | Github <br> Paper <br> Blog |
Preble: Efficient Distributed Prompt Scheduling for LLM Serving <br> Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang | <img width="1002" alt="image" src="figures/preble.png"> | Paper |
<br>EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting <br> Zhongzhi Yu, Zheng Wang, Yuhan Li, Haoran You, Ruijie Gao, Xiaoya Zhou, Sreenidhi Reedy Bommu, Yang Katie Zhao, Yingyan Celine Lin | <img width="1002" alt="image" src="https://github.com/GATECH-EIC/Edge-LLM/blob/main/images/Edge-LLM-overview.png"> | Github <br> Paper |
<br>Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization <br> Jungi Lee, Wonbeom Lee, Jaewoong Sim | <img width="1002" alt="image" src="https://arxiv.org/html/2406.12930v1/x4.png"> | Paper |
Tuning
Title & Authors | Introduction | Links |
---|---|---|
Tensor Train Low-rank Approximation (TT-LoRA): Democratizing AI with Accelerated LLMs <br> Afia Anjum, Maksim E. Eren, Ismael Boureima, Boian Alexandrov, Manish Bhattarai | <img width="1002" alt="image" src="https://arxiv.org/html/2408.01008v1/x7.png"> | Paper |
Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning <br> Yun-Da Tsai, Mingjie Liu, Haoxing Ren | <img width="1002" alt="image" src="https://arxiv.org/html/2407.05040v1/x1.png"> | Paper |
<br>PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs <br> Dan Peng, Zhihui Fu, Jun Wang | Paper | |
<br>Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning <br> Haobo Song, Hao Zhao, Soumajit Majumder, Tao Lin | <img width="1002" alt="image" src="https://arxiv.org/html/2407.01320v1/x2.png"> | Github <br> Paper |
Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead <br> Rickard Brüel-Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen et al | <img width="1002" alt="image" src="https://arxiv.org/html/2407.00066v1/x1.png"> | Paper |
BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks <br> Amrutha Varshini Ramesh, Vignesh Ganapathiraman, Issam H. Laradji, Mark Schmidt | <img width="1002" alt="image" src="https://arxiv.org/html/2406.17296v1/x3.png"> | Paper |
Survey
Title & Authors | Introduction | Links |
---|---|---|
Hardware Acceleration of LLMs: A comprehensive survey and comparison <br> Nikoletta Koilia, Christoforos Kachris | Paper | |
A Survey on Symbolic Knowledge Distillation of Large Language Models <br> Kamal Acharya, Alvaro Velasquez, Houbing Herbert Song | <img width="1002" alt="image" src="https://arxiv.org/html/2408.10210v1/extracted/5727556/Images/DirectDistillation.png"> | Paper |
<br>Inference Optimization of Foundation Models on AI Accelerators <br> Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis | Paper | |
Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application <br> Chuanpeng Yang, Wang Lu, Yao Zhu, Yidong Wang, Qian Chen, Chenlong Gao, Bingjie Yan, Yiqiang Chen | <img width="1002" alt="image" src="https://arxiv.org/html/2407.01885v1/extracted/5702255/1.png"> | Paper |
Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference <br> Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura | <img width="1002" alt="image" src="figures/CIM.png"> | Paper |