Awesome
<div align="center"><h1> PEARL: Parallel Speculative Decoding with Adaptive Draft Length</h1></div> <p align="center"> | <a href="https://arxiv.org/pdf/2408.11850"><b>Paper </b></a> | <a href="https://pearl-code.github.io/"><b>Blog</b></a> | </p> --- <center> <img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="https://s2.loli.net/2024/08/13/u3tc4FAwxQG126y.png" width = "100%" alt=""/> <br> <div style="color:orange; border-bottom: 1px solid #d9d9d9; display: inline-block; color: #999; padding: 2px;"> Figure 1. Speedup on HumanEval. All the experiments are conducted with H100 80G GPUs. The part results of <a href="https://github.com/thunlp/Ouroboros">Ouroboros</a> and <a href="https://github.com/hao-ai-lab/LookaheadDecoding">Lookahead Decoding</a> are reproduced with their official codes. </div> </center> <br><br> <!-- Using HTML to center the abstract -->TL; DR: we introduce PEARL (Parallel spEculative decoding with Adaptive dRaft Length) to further reduce the inference latency of Large Language Models (LLMs). PEARL is a parallel inference framework based on speculative decoding which utilizes pre-verify and post-verify to achieve adaptive draft length. In summary, our PEARL is:
- 🔥 up to 3.87$\times$, 3.81$\times$, 3.59$\times$ and 3.95$\times$ on HumanEval, GSM8K, MT-bench and MGSM, respectively.
- provably lossless
- training-free, and does not need additional memory
- 🔥 can be applied to any algorithms based on draft-then-verify framework, such as EAGLE and Medusa
- 🔥 Eliminating the burden of searching the optimal draft length, together with a larger expectation of accepted tokens.
<br> <div class="columns is-centered has-text-centered"> <div class="column is-four-fifths"> <h2>Demo</h2> <div class="content has-text-justified"> </div> </div> </div> <p align="center" style="color:gray;">Figure 2. Generation speed of Llama 2 chat 70B using PEARL and auto-regressive decoding, with inference conducted on A100 80G GPUs at bf16 precision. </p>
<br> <div class="columns is-centered has-text-centered"> <div class="column is-four-fifths"> <h2>Overview of PEARL</h2> <div class="content has-text-justified"> </div> </div> </div>
Our PEARL framework consists of a draft model, a target model and two strategies to decode tokens. The two strategies are switched according to the verification results in the last decoding step.
<center> <img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="https://s2.loli.net/2024/08/13/aoCAybN7S2KWsXd.png" width = "100%" alt=""/> <br> <div style="color:orange; border-bottom: 1px solid #d9d9d9; display: inline-block; color: #999; padding: 2px;"> Figure 3. Overview of PEARL. PEARL achieves parallelism through adaptively using pre-verify and post-verify. </div> </center>preparation
Follow the instructions below to prepare for reproducing the results in the paper.
- experimental environment:
sh install.sh
will install the necessary packages in the project. - code changes: changes the code
src/util.py
line 31-38 and line 49, to fill in your model paths and data paths.
reproduction
All the running scripts, including scripts for auto-regress decoding, vanilla speculative decoding, parallel speculative decoding, comparison, ablation studies and case studies. These scripts can be directly executed for reproduction.
sh scripts/run_para_sd.sh
Examples
You can try this code with a simple command:
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 2 benchmark/eval_humaneval.py --eval_mode para_sd --gamma 5 -n 1 -e H_PSD_codellama_7_70b --draft_model codellama-7b --target_model codellama-70b --max_tokens 1024 --temp 0
With UI
We have provided a suggested web interface, which you can use by running the following command.
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 2 applications --eval_mode para_sd --gamma 5 -n 1 -e applications --draft_model codellama-7b --target_model codellama-70b --max_tokens 1024 --temp 0
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2>Citation</h2>
<div class="content has-text-justified">
</div>
</div>
</div>
If you find our work useful your research, please cite our paper:
@misc{liu2024parallelspeculativedecodingadaptive,
title={Parallel Speculative Decoding with Adaptive Draft Length},
author={Tianyu Liu and Yun Li and Qitan Lv and Kai Liu and Jianchen Zhu and Winston Hu},
year={2024},
eprint={2408.11850},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.11850},
}