Awesome
DiverseEvol: Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning
<p align="center"> š <a href="https://arxiv.org/abs/2311.08182" target="_blank">Paper</a> <br> </p>We introduce DiverseEvol, an efficient instruction-tuning method that allows the model itself to iteratively sample training subsets to improve its own performance, without any external supervision from humans or more advanced LLMs. Central to our data selection technique is the maintenance of high diversity in the chosen subsets, as the model selects new data points most distinct from any existing ones according to its current embedding space. In experiments across three datasets and benchmarks, our models, trained on less than 8% of the original dataset, maintain or improve performance compared with finetuning on full data.
Below shows the overall workflow of DiverseEvol:
<p align="center"> <img src="assets/diverse_evol_model.png" width=100%> </p>Getting Started
Prerequisites
Make sure you have the following prerequisites installed:
- Python >= 3.10
- PyTorch >= 2.0
- Cuda >= 11.7
- Transformers >= 4.28.1
- Vendi_Score: https://github.com/vertaix/Vendi-Score
Environment
Or create a Conda environment and activate it with the provided environment.yml
file:
conda env create -f environment.yml
source activate diverse_evol
Running DiverseEvol
All configurations are stored in .yml
files. We provide two example configs:
configs/config_debug.yml
: for debugging on a small dataset.configs/config_example_kc_dolly.yml
: for running KCenter-based DiverseEvol on the Databricks-Dolly dataset.
You can customize these configurations to suit your experiment needs. Key configurations to consider include:
full_data_path
: Path to the source dataset.model_name_or_path
: Path to the foundation LLM weights and tokenizers.evol_schedule_name
: The sampling method to apply, e.g., "KCenterSampling."result_dir_name
: A name identifier for each DiverseEvol run. We use this to create folder storing all the resulting data subsets and finetuned models.init_label_num
: The number of samples randomly selected to train the initial model (rd_0).n_round
: The number of iterations to run.n_query
: The number of new data points to add to the next round's training.
To start an experiment, use the following command, specifying your configuration file and wandb key:
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python -m torch.distributed.run --nproc_per_node=4 train.py --config_file {YOUR-CONFIG-FILE} --wandb_key {YOUR-WANDB-KEY} > {YOUR-LOG-FILE} 2>&1 &
Your experiment results will be saved in the following format:
evol_res
āāā {YOUR-RESULT-DIR-NAME}
āāā data # training data pool and unselected data pool for each iteration
āĀ Ā āāā rd_0_labeled.json
āĀ Ā āāā rd_0_unlabeled.json
āĀ Ā āāā rd_1_labeled.json
āĀ Ā āāā rd_1_unlabeled.json
āĀ Ā āāā ...
āĀ Ā āāā rd_N_labeled.json
āĀ Ā āāā rd_N_unlabeled.json
āāā output # instruction-tuned chat model for each iteration
āāā rd_0
āāā rd_1
āāā ...
āāā rd_N
Generating Answers
With models trained in each iteration, you can generate responses to questions in different test sets and customize the range of iterations to consider.
To generate answers, use the following command:
nohup python eval.py --eval_stage generate_answer --device {GPU-IDX} --schedule {YOUR-RESULT-DIR-NAME} --rd_start 0 --rd_end 10 --testsets vicuna koala wizarlm > {YOUR-LOG-FILE} 2>&1 &
This command will create a folder structure for the generated answers in the following format:
evol_answer
āāā koala_test
āĀ Ā āāā {YOUR-RESULT-DIR-NAME}
āĀ Ā āāā rd_0.jsonl
āĀ Ā āāā rd_1.jsonl
āĀ Ā āāā ...
āĀ Ā āāā rd_N.jsonl # koala_bench answers generated by chat model trained in iteration/round N
āāā vicuna_test
āĀ Ā āāā {YOUR-RESULT-DIR-NAME}
āĀ Ā āāā rd_0.jsonl
āĀ Ā āāā rd_1.jsonl
āĀ Ā āāā ...
āĀ Ā āāā rd_N.jsonl
āāā wizardlm_test
āāā {YOUR-RESULT-DIR-NAME}
Ā Ā āāā rd_0.jsonl
Ā Ā āāā rd_1.jsonl
Ā Ā āāā ...
Ā Ā āāā rd_N.jsonl
Feel free to customize the test sets (--testsets
) and iteration range (--rd_start
& --rd_end
) to suit your specific analysis needs.
Additionally, we provide all the answers generated by our trained models in evol_answer_reference/
, which directly leads to our reported results. We also include the answers by ChatGPT (GPT-3.5-TURBO) that serves as a general competitor in our paper.
GPT4-Judge
After generating the answers to testsets, you can compare them using GPT4. Refer to our GPT4_JUDGE_TEMPLATE
in consts.py
or Appendix A in our paper.
In evol_gpt4_judge_reference/
, you can find the GPT4-judge responses to all the comparisons reported in our paper.
Analyzing Dataset Diversity
You can also evaluate the selected data subsets in terms of diversity (measured by Vendi-Score) using the following command:
nohup python eval.py --eval_stage analyse_diversity --device {GPU-IDX} --schedule {YOUR-RESULT-DIR-NAME} --embed_model_path {PATH-TO-THE-MODEL-USED-FOR-EMBEDDING} --rd_start 0 --rd_end 10 > {YOUR-LOG-FILE} 2>&1 &
This command will provide vendi-scores of each round's training data directly in the output. Additionally, it should result in a .pkl
file that stores the reported diversity results:
evol_diversity
āāā {YOUR-RESULT-DIR-NAME}_RD={rd_start}-{rd_end}_MEASURES=vendi.pkl
You can also adjust the iteration range (--rd_start
& --rd_end
).
Acknowledgement
Our implementation is largely inspired by the following codebases:
- stanford_alpaca: A repository that shares the finetuning scripts of instruction-following LLaMA models.
- DeepAL: A toolbox for various data sampling logics.
Citation
If you find our paper and code useful in your research, please consider giving us a star ā and citing us š:
@article{wu2023self,
title={Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning},
author={Wu, Shengguang and Lu, Keming and Xu, Benfeng and Lin, Junyang and Su, Qi and Zhou, Chang},
journal={arXiv preprint arXiv:2311.08182},
year={2023}
}
If you have any questions or need further assistance, please don't hesitate to reach out: wushengguang.wsg@alibaba-inc.com
or lukeming.lkm@alibaba-inc.com
.
Happy experimenting with DiverseEvol ^_^