

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

This is the official repository implements Smart Parallel Auto-Correct Decoding (SPACE), a novel approach to accelerate inference of LLMs by integrating semi-autoregressive inference and draft-then-verify capabilities.

SPACE Fig. 1: A visual comparison between conventional AR inference (left) and SPACE inference (right) is illustrated. In AR inference, token generation proceeds in a sequential manner, with only one token output per decoding step. In SPACE inference, the input token sequence (i.e., "LLMs are") is augmented with k+1 groups of mask tokens and k candidate tokens (i.e., "auto" and "model"). The candidate tokens undergo verification to obtain accepted tokens (i.e., "auto" and "regressive"), and k new candidate tokens (i.e., "model" and "<s>") are generated from one of the mask groups after a single model invocation. SPACE allows for a variable number of tokens to be generated in each step, with the quantity ranging from a minimum of 1 to a maximum of k+1.

Main Result: result

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding
Hanling Yi, Feng Lin, Hongbin Li, Peiyang Ning, Xiaotian Yu, Rong Xiao

Setting Up Environment

Install Dependencies

pip install -r requirements.txt


We use LLaMa-2-7B as the base model for SPACE training in the example.

Download the checkpoint of LLaMa-2-7B and then change the model_name_or_path in run_sft_multi_node.sh. Run the following comment to start training on one machine with 8 GPUs.

bash run_sft_multi_node.sh


For evaluation, change llm_dir in run_eval.sh to the output dir and run the following for evaluation

bash run_eval.sh

[2024/5/19] We have released a Vicuna-7B model trained with SPACE, please download the checkpoint from HF and run the following for evalution.

python tests/eval_infer.py --llm_dir=path/to/model --mask_id=32002 --dataset="human_eval" --mask_num=5 --do_sample=false --use_cache=true --model_type=llama --mask_diff=false


This repository is licensed under the Apache-2.0 License.


If this work is helpful, please kindly cite as:

  title={Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding},
  author={Yi, Hanling and Lin, Feng and Li, Hongbin and Ning, Peiyang and Yu, Xiaotian and Xiao, Rong},
  journal={arXiv preprint arXiv:2402.11809},


This repo benefits from LLaMA Factory and FastChat. Thanks for their wonderful works.