Home

Awesome

<img src="assets/logo.png" alt="SWIFT" width="100" align="left"><div align="center"><h1> SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration</h1></div>

<p align="center"> <a href="https://arxiv.org/abs/2410.06916"> <img src="https://img.shields.io/badge/Arxiv-2410.06916-orange.svg"></a> <a href="https://opensource.org/licenses/Apache-2.0"> <img src="https://img.shields.io/badge/License-Apache_2.0-green.svg"></a> <a href="https://github.com/hemingkx/SWIFT/pulls"> <img src="https://img.shields.io/badge/Contributions-welcome-blue.svg?style=flat"></a> </p>

Introduction

SWIFT is an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference. This method does not require auxiliary models or additional training, making it a plug-and-play and cost-effective solution for accelerating LLM inference.

SWIFT divides LLM inference into two distinct phases:

During the optimization stage, SWIFT performs an optimization step prior to each LLM decoding step to adjust the skipped layer set, which involves: a) Efficient layer set optimization. SWIFT integrates random search with interval Bayesian optimization to propose layer set candidates efficiently; b) Parallel candidate evaluation. SWIFT uses LLM-generated tokens as ground truth, enabling simultaneous validation of the proposed candidates. The best-performing layer set is selected to accelerate the current decoding step.

swift

Todo

Installation

conda create -n swift python=3.9
conda activate swift
cd SWIFT
pip install -r requirements.txt

Inference

Run command lines in eval_llama.sh, the results will be stored in outputs/.../model_answer/.

./eval_llama.sh

For quick start with cached layer configuration, uncomment --cache-hit in eval_llama.sh.

Speedup Report

Obtain the corresponding speedup compared to vanilla autoregressive decoding.

python evaluation_llama/speed.py --file-path /your_own_path/swift.jsonl --base-path /your_own_path/llama_vanilla.jsonl

Acknowledgments

This codebase is built from Self-SD and EAGLE. The logo is designed by GPT-4.

Citation

If you find the resources in this repository useful, please cite our paper:

@misc{xia2024swiftontheflyselfspeculativedecoding,
      title={SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration}, 
      author={Heming Xia and Yongqi Li and Jun Zhang and Cunxiao Du and Wenjie Li},
      year={2024},
      eprint={2410.06916},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.06916}, 
}