Awesome
AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment
Yonggan Fu, Zhongzhi Yu, Junwei Li, Jiayi Qian, Yongan Zhang, Xiangchi Yuan, Dachuan Shi, Roman Yakunin, and Yingyan (Celine) Lin
Accepted at NeurIPS 2024 [Paper | Slide].
AmoebaLLM: Overview
- How to Train Once and Derive Many Efficient LLMs? We introduce AmoebaLLM, a novel framework designed to instantly derive LLM subnets of arbitrary shapes, which achieve the accuracy-efficiency frontier and can be extracted after merely a one-time fine-tuning. In this way, AmoebaLLM facilitates rapid deployment tailored to different platforms and application-driven specifications. Specifically, AmoebaLLM achieves this goal by strategically extracting high-performing subnets and training them jointly to avoid conflicts.
- Experimental Results: AmoebaLLM not only sets new standards in LLM adaptability but also successfully delivers subnets that achieve SOTA trade-offs between accuracy and efficiency.
Code Usage
Environment Setup
Use conda to setup the environment based on the provided env.yml
:
conda env create -f env.yml
Stage 1: Knowledge-preserving subset selection
- Step 1: Derive layer selection strategy using dynamic programing:
CUDA_VISIBLE_DEVICES=0 python main.py --model_name_or_path meta-llama/Llama-2-7b-hf --fp16 --output_dir ./output/calib_dp --do_train False --do_eval False --no_eval_orig --layer_calib_dp --calib_dataset mmlu --enable_shrinking --num_calib_sample 40 --calib_metric acc --min_num_layer 20 --dp_keep_last_layer 1
- Step 2: Derive neuron (width) selection strategy using the importance metric in FLAP:
CUDA_VISIBLE_DEVICES=0 python main.py --model_name_or_path meta-llama/Llama-2-7b-hf --fp16 --output_dir ./output/width_calib --do_train False --do_eval False --use_auth_token --no_eval_orig --width_calib --num_calib_sample 512 --prune_width_method flap
- Step 3: Merge the layer and neuron selection strategy into the same file
dp_selection_strategy.npy
(we have also provided this file for LLaMA2-7B in the repo):
python utils/merge_depth_width.py
Stage 2: One-for-all fine-tuning
- Enable one-for-all fine-tuning using
--do_train True
and--enable_shrinking
, and specify the subset selection strategy provided by Stage 1 with--shrinking_file dp_selection_strategy.npy
:
CUDA_VISIBLE_DEVICES=0 python main.py --model_name_or_path meta-llama/Llama-2-7b-hf --output_dir ./output/ft --dataset alpaca-gpt4 --use_auth_token --do_train True --do_eval True --do_mmlu_eval True --do_eval_wikitext2 True --lora_modules all --fp16 --source_max_len 384 --target_max_len 128 --gradient_accumulation_steps 4 --logging_steps 10 --max_steps 10000 --save_strategy steps --data_seed 42 --save_steps 1000 --save_total_limit 1 --evaluation_strategy steps --eval_dataset_size 1024 --max_eval_samples 1000 --eval_steps 1000 --optim paged_adamw_32bit --ddp_find_unused_parameters --enable_shrinking --kd_weight 1 --min_num_layer 20 --random_sample_num_layer 2 --distill_method sp --shrinking_method calib_dp --shrinking_file dp_selection_strategy.npy --shrinkable_width --width_choice [1,7/8,3/4,5/8] --prune_width_method flap --use_moe_lora --moe_num_expert 5 --moe_topk 2
Evaluation
- In addition to your fine-tuned model created using the two-stage process described above, we have also provided our AmoebaLLM fine-tuned LLaMA2-7B model,
amoeba_llama2
, here. You can download and unzip it using the following command:
pip install gdown
gdown 1lwOiQa-UOYOXn72wo5gvzUvFat_PTg6b
unzip amoeba_llama2.zip
- Specify
--output_dir
as the path to the fine-tuned model and specify the target depth and width ratios using--eval_num_layer
and--eval_num_width
, respectively:
CUDA_VISIBLE_DEVICES=0 python main.py --model_name_or_path meta-llama/Llama-2-7b-hf --output_dir amoeba_llama2 --do_train False --do_eval True --do_mmlu_eval True --bits 8 --bf16 --enable_shrinking --min_num_layer 20 --shrinking_method calib_dp --shrinking_file dp_selection_strategy.npy --shrinkable_width --width_choice [1,7/8,3/4,5/8] --prune_width_method flap --use_moe_lora --moe_num_expert 5 --moe_topk 2 --eval_num_layer 24 --eval_num_width 0.875 --do_lm_eval True --do_lm_eval_task arc_easy,piqa,hellaswag
Acknowledgment
We refer to the implementations in qlora.
Citation
@inproceedings{fuamoeballm,
title={AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment},
author={Fu, Yonggan and Yu, Zhongzhi and Li, Junwei and Qian, Jiayi and Zhang, Yongan and Yuan, Xiangchi and Shi, Dachuan and Yakunin, Roman and Lin, Yingyan Celine},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}
}