Home

Awesome

Instruction in the Wild: A User-based Instruction Dataset

News

We release InstructWild v2 under data v2 dir, which includes over 110K high-quailty user-based instructions. We did not use self-instruct to generate any instructions. We also label a subset of these instructions with instruction type and speical tag. Please see README for details.

Introduction

Instruction Tuning is a key component of ChatGPT. OpenAI used their user-based Instruction dataset, but unfortunately, this dataset is not open-sourced. Self-Instruct released a small instruction dataset including 175 instructions written by human labors. Standford Alpaca Team generated 52K instructions by text-davinci-003 model based on the the 175 seed instructions above.

This project targets on a larger and more diverse instruction dataset. To this end, we collected (110K in v2 dataset, 429 in v1 dataset) instructions from ChatGPT usage sharing and released both English and Chinese versions. We found these instructions are very diverse. We follow Alpaca to generate 52K instructions and their responses. All data can be found in data and data v2 dir.

Note: This is an ongoing project. We are still collecting and improving our data. We release this dataset as early as possible to speedup our LLM research. We will also release a whitepaper soon.

Data Release

Our dataset use the same format as Alpaca for fast and easy usage. Our instructions have no input field.

Data Collection (InsturctWild v1)

data_collection

We scrapt over 700 noisy instructions from Twitter and filter out noisy instructions. We then pick 429 clean insturctions to ensure the high quality.

We use a similar method as Alpaca to collect instructions. However, we do not need outputs for instructions thus avoid human involvement. The prompts generated are more diverse and covers more topics compared to the Alpaca's.

We provide 5 prompts as examples for generating new instructions from OpenAI API. After collecting prompts, we collect responses of these instructions from OpenAI API. The English and Chinese datasets are generated seperately. In total, 880$ are spent to collect the dataset. There are 52K instructions for English (around 24M tokens) and 52K instructions for Chinese.

How Good is InstructWild?

Colossal AI used our model to train the ColossalChat model. The ColossalChat-7B (only after stage-1) combines the original alpaca dataset and our dataset. We compare the ColossalChat-7B with Alpaca-7B to see what improvement our dataset brings.

It is difficult to evaluate Chatbot. We human-evaluate several examples under different categories of instructions. Our main findings are:

Pros

Limitations for LLaMA-finetuned models

Limitations of dataset

Detailed Comparison

See HERE for detailed comparison.

TODO

Authors

This project is maintained by the following authors (currently):

We also acknowledge the valuable suggestions from Prof. Aixin Sun, Dr. Tom Young.

Citation

Please cite the repo if you use the data or code in this repo.

@misc{instructionwild,
  author = {Jinjie Ni and Fuzhao Xue and Kabir Jain and Mahir Hitesh Shah and Zangwei Zheng and Yang You },
  title = {Instruction in the Wild: A User-based Instruction Dataset},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/XueFuzhao/InstructionWild}},
}