Home

Awesome

<div align="center"> <h2><img src="assets/logo.png" height="28px"/><i>Unlocking Efficiency in Large Language Model Inference:</i><br>A Comprehensive Survey of Speculative Decoding</h2> </div> <div align="center"> <b>Heming Xia</b><sup>1</sup>, <b>Zhe Yang</b><sup>2</sup>, <b>Qingxiu Dong</b><sup>2</sup>, <b>Peiyi Wang</b><sup>2</sup>, <b>Yongqi Li</b><sup>1</sup>, <b>Tao Ge</b><sup>3</sup>, <b>Tianyu Liu</b><sup>4</sup>, <b>Wenjie Li</b><sup>1</sup>, <b>Zhifang Sui</b><sup>2</sup> </div> <div align="center"> <sup>1</sup>Department of Computing, The Hong Kong Polytechnic University </div> <div align="center"> <sup>2</sup>National Key Laboratory for Multimedia Information Processing, Peking University </div> <div align="center"> <sup>3</sup>Microsoft Research Asia <sup>4</sup>Alibaba Group </div>

timeline

This repository contains a regularly updated paper list for Speculative Decoding.

Arxiv Awesome License GitHub last commit (branch)

Content

Keywords Convention

Abbreviation

Conference

Drafting Methods in Speculative Decoding

Main Features

Papers

Survey

Speculative Decoding for Seq2Seq

Speculative Decoding for LLMs

Multimodal Speculative Decoding

Long-Context Speculative Decoding

Alignment

Benchmarks

Applications

Analysis

Other Techniques

Blog & Project

Assisted Generation: a new direction toward low-latency text generation. Huggingface. 2023.05. [Blog] [Code]

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads. Princeton, UIUC. 2023.09. [Blog] [Code]

An Optimal Lossy Variant of Speculative Decoding. Unsupervised Thoughts (blog). 2023.09. [Blog] [Code]

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding. LMSys. 2023.11. [Blog] [Code]

Accelerating Generative AI with PyTorch II: GPT, Fast. Pytorch. 2023.11. [Blog] [Code]

Prompt Lookup Decoding. Apoorv Saxena. 2023.11. [Code] [Colab]

REST: Retrieval-Based Speculative Decoding. Peking University, Princeton University. 2023.11. [Blog] [Code]

EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation. Vector Institute, University of Waterloo, Peking University. 2023.12. [Blog] [Code]

SEQUOIA: Serving exact Llama2-70B on an RTX4090 with half-second per token latency. Carnegie Mellon University, Together AI, Yandex, Meta AI. 2024.02. [Blog] [Code]

The Mamba in the Llama: Distilling and Accelerating Hybrid Models. Together AI. 2024.09. [Blog] [Code]

How Speculative Decoding Boosts vLLM Performance by up to 2.8x. vLLM Team. 2024.10. [Blog]

Contributors

<a href="https://github.com/hemingkx/SpeculativeDecodingPapers/graphs/contributors"> <img src="https://contrib.rocks/image?repo=hemingkx/SpeculativeDecodingPapers" /> </a>

Contributing to this paper list

Citation

If you find the resources in this repository useful, please cite our paper:

@inproceedings{xia-etal-2024-unlocking,
    title = "Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding",
    author = "Xia, Heming and Yang, Zhe and Dong, Qingxiu and Wang, Peiyi and Li, Yongqi  and Ge, Tao and Liu, Tianyu and Li, Wenjie and Sui, Zhifang",
    editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.456",
    doi = "10.18653/v1/2024.findings-acl.456",
    pages = "7655--7671",
}