Awesome
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
This repository implements Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.
<p align="left"> <img src='docs/headline.png' align="center" height="320px"> </p>BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models,
Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian YU, Guangming Lu, Rong Xiao
Setting Up Environment
Mainly consist of the following three steps.
-
Install Dependencies
pip install -r requirements.txt
-
Download the LLM Intended for Acceleration as Base Model
-
Create Symbolic Links from Source Base Models to the Checkpoint Directory
ln -s SOURCE_CHECKPOINT_PATH checkpoints/TARGET_CHECKPOINT_NAME
Preparing Data
We describe separately how to prepare the training datasets and the test datasets.
-
Training Datasets
We provide a small set of training samples for LLaMA-2-7B-Chat in this repository. The complete training data for LLaMA-2-7B-Chat can be found here.
For experiments with other base models, such as Falcon, additional preparation of training data is required.
-
Start the TGI service for executing Falcon inference:
text-generation-launcher --model-id checkpoints/falcon-40b-instruct-hf --trust-remote-code --max-input-length 2048 --max-total-tokens 4096 --sharded true --num-shard 8
-
Generate prompts, also referred to as queries or questions, using the predefined Falcon templates:
python3 test/gen_prompt.py --model_type falcon --output_path data/assembled_v2/falcon-40b/alpaca_lima_cip-50k_code_platypus_v2-prompt2.jsonl
-
Generate Falcon outputs based on greedy sampling, forming prompt-response (question-answer) pairs as the training samples:
# NUM_PROCESS denotes the number of processes executing simultaneously through TGI. # IP denotes the IP address providing the TGI service. python3 test/gen_llm_output.py data/assembled_v2/falcon-40b/alpaca_lima_cip-50k_code_platypus_v2-prompt2-output.jsonl data/assembled_v2/falcon-40b/tmp NUM_PROCESS IP
-
Merge all jsonl files in directory
data/assembled_v2/falcon-40b/tmp
into one filealpaca_lima_cip-50k_code_platypus_v2-prompt2-output.jsonl
and place it in directorydata/assembled_v2/falcon-40b
.
-
-
Test Datasets
We offer the MT-Bench dataset in this repository, while other datasets for evaluation (XSum, CIP-test, and HumanEval-X) can be found here.
Training
We using LLaMA-2-7B-Chat as the base model for BiTA training in the example.
-
Single-Node
Run the script:
sh scripts/run_sft-pt2_llama2-7b-chat.sh
-
Multi-Node
We employ the DeepSpeed library for multi-node training: (in our implementation, 32 NVIDIA A800-80GB GPUs are utilized)
# remove any existing hostfile rm -rf hostfile # generate a new hostfile sh gen_openpai_hostfile.sh > hostfile # run the training script sh scripts/run_deepspeed_sft-pt2_llama2-70b-chat.sh
Evaluation
We provide scripts for both single-GPU testing and multi-GPU testing. The accelerated LLaMA-2-7B-Chat is evaluated using the following scripts. For other base models, simply adjust the path TEST_DIR
and related hyperparameters (MODEL_TYPE
, MASK_ID
, etc.) in the scripts.
-
Single-GPU
sh scripts/run_eval_mt-bench.sh
-
Multi-GPU
sh scripts/run_multigpu_eval_mt-bench.sh
Performance
We present concise speedup results of model acceleration below; for more detailed results, please refer to our paper.
Model | XSum | MT-Bench | CIP | HumanEval-X |
---|---|---|---|---|
LLaMA-2-7B | 2.19 | 2.38 | 2.29 | 2.73 |
LLaMA-2-13B | 2.29 | 2.41 | 2.39 | 2.88 |
Vicuna-33B | 2.20 | 2.47 | 2.10 | 3.00 |
Falcon-40B | 2.28 | 2.75 | 2.32 | 3.07 |
LLaMA-2-70B | 2.55 | 2.72 | 2.58 | 3.31 |
License
This repository is licensed under the Apache-2.0 License.
Please follow the model licenses to use the corresponding model weights: LLaMA-2 / Vicuna / Falcon
Citation
If you find this project useful in your research, please kindly cite:
@article{lin2024bita,
title={BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models},
author={Lin, Feng and Yi, Hanling and Li, Hongbin and Yang, Yifan and Yu, Xiaotian and Lu, Guangming and Xiao, Rong},
journal={arXiv preprint arXiv:2401.12522},
year={2024}
}
Acknowledgement
This repository greatly benefits from LLaMA-Factory. We extend our gratitude for their outstanding contributions.
Contact
Please feel free to reach out if you have any questions! Email: lin1993@mail.ustc.edu.cn