Home

Awesome

<p align="center" width="100%"> </p> <div id="top" align="center"> <img src=imgs/dynosaur.png width=150 />

Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation (EMNLP 2023)

<h3> |<a href="https://arxiv.org/abs/2305.14327"> Paper </a> | <a href="https://dynosaur-it.github.io/"> Project Website </a> | <a href="https://huggingface.co/datasets?search=dynosaur"> 🤗 Data </a> | <a href="https://huggingface.co/models?sort=trending&search=dynosaur"> 🤗 Model </a> | </h3> <h4> <a href="https://wadeyin9712.github.io/">Da Yin</a>*, <a href="https://xxxiaol.github.io/">Xiao Liu</a>*, <a href="https://fanyin3639.github.io/">Fan Yin</a>*, <a href="https://maszhongming.github.io/">Ming Zhong</a>*, <a href="https://sites.google.com/view/hbansal">Hritik Bansal</a>, <a href="http://hanj.cs.illinois.edu/">Jiawei Han</a>, <a href="http://web.cs.ucla.edu/~kwchang/">Kai-Wei Chang</a> </h4> </div>

Dynosaur aims to 1) build a dynamically growing instruction tuning dataset without low cost for maintenance, and 2) provide a venue to study how to dynamically improve instruction tuning models. The repo contains:

Usage and License Notices: All the generated task instructions (except the instances of each task) are released under Apache-2.0 license. The instances of each tasks are subject to the license under which the original dataset was released. These license information are available in Dynosaur data and instruction_data/license_info.json.

Updates

Overview

We propose Dynosaur, a large-scale instruction tuning dataset obtained automatically with significantly lower generation costs. Dynosaur leverages the metadata of existing NLP datasets to generate task instructions and organize corresponding inputs/outputs. By utilizing LLMs, we generate multiple task instructions applicable to various NLP domains and determine the relevant data fields for constructing instruction tuning data.

Dynosaur offers several advantages, including

Data Release

We offer dynosaur-full, containing all the generated instruction tuning data in Dynosaur. It covers most licensed and non-null English datasets in 🤗 Huggingface Datasets as of Feb 23, 2023. The data is a dictionary containing the following keys:

We also provide the collected metadata huggingface_metadata.jsonl and data huggingface_data.jsonl from Huggingface. They are the very foundation of synthesizing Dynosaur instructions.

Data Generation Process

To generate Dynosaur-full, please follow the following commands step by step:

Metadata Collection

python parse_hf.py                                            # crawl Huggingface datasets and collect metadata
python license_info.py                                        # capture license information of each dataset
python check_multilingual.py                                  # select English-only datasets

If you want to skip these steps, you may use the collected metadata huggingface_metadata.jsonl and data huggingface_data.jsonl on AWS S3.

Instruction Generation and Filtering

python generate_tasks_with_description.py                     # generation description-aware tasks
python generate_tasks_without_description.py                  # generation description-unaware tasks
python filter_invalid_tasks.py                                # filter out invalid tasks
python organize_data.py                                       # organize instruction data

These result in the instruction data instruction_data/instruction-full.json and instruction tuning dataset dynosaur-full. We will provide the sampled data for evaluation on Super-NI and user instructions soon.

Fine-tuning

We fine-tune T5-3B and LLaMA-7B with the following hyperparameters:

HyperparameterT5-3BLLaMA-7B
Batch size16128
Learning rate1e-53e-4
Epochs23
Max length512512

To fine-tune your own models, please refer to the training code of Tk-Instruct and Stanford Alpaca or Alpaca-LORA.

Limitations

Dynosaur is still under development and needs a lot of improvements. We are still studying method that can better control the instruction quality and generate more diverse instructions. We are also trying to control any biases introduced in Dynosaur. Stay tuned!

Citation

If you find this work is relevant with your research, please feel free to cite our work!

@article{yin2023dynosaur,
  title={Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation},
  author={Yin, Da and Liu, Xiao and Yin, Fan and Zhong, Ming and Bansal, Hritik and Han, Jiawei and Chang, Kai-Wei},
  journal={EMNLP},
  year={2023}
}

As our Dynosaur data is based on Huggingface Datasets Resources, please also cite Huggingface Datasets paper:

Acknowledgements

We greatly appreciate Huggingface for their great effort in open-sourcing fantastic NLP datasets! We also sincerely thank the authors of all the datasets we incorporate in Dynosaur.

We also thank Yizhong Wang for providing the code for diversity analysis plot and Tk-instruct training code, and Stanford Alpaca Team for releasing the code to finetune LLAMA.