Home

Awesome

<div align="center">

Awesome Instruction Datasets

Awesome

</div> <div align="center">

中文 | English

</div>

Contents

Introduction

"Welcome to 'awesome-prompt-datasets', a comprehensive collection of high-quality open-source instruction tuning datasets to train chat-based LLMs (ChatGPT,LLaMA,Alpaca)。

Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.

With 'awesome-prompt-dataset', you can accelerate your research and development in NLP and unlock new opportunities for innovation. Let's explore the possibilities together!"

Prompt Datasets

Referring to this (@yaodongC), we labeled each collected dataset according to the following rules:

(Lang)Lingual-Tags:

(Task)Task-Tags:

(Gen)Generation-method:

Statistics

ProjectDatasetsOrgNumsLangTaskGenTypeSrcUrl
Chain of Thoughtcot_data |few_shot_dataGoogle74771EN/CNMTHGinstruct with cot reasoningannotating CoT on existing datadownload
GPT4allnomic-ai/gpt4all-j-prompt-generationsnomic-ai806199ENMTCOLcode, storys and dialogsdistillation from GPT-3.5-turbodownload
GPTeacherGPT-4 General-Instruct |Roleplay-Instruct |Code-Instruct | Toolformerteknium129013ENMTSIgeneral, roleplay, toolformerGPT-4 & toolformerdownload
GuanacoJosephusCheung/GuanacoDatasetJosephusCheung534610MLMTSIvarious linguistic taskstext-davinci-003download
HC3Hello-SimpleAI/HC3Hello-SimpleAI | 万得资讯37175EN/CNTSMIXdialogue evaluationhuman or ChatGPTdownload
HC3-ChineseHello-SimpleAI/HC3-ChineseHello-SimpleAI|万得资讯13kCNTSMIXdialogue evaluationhuman or ChatGPT
alpacatatsu-lab/alpacatatsu-lab52002ENMTSIgeneral instructtext-davinci-003download
AlpacaDataCleanedyahma/alpaca-cleanedyahma52kENMTSIgeneral instructtext-davinci-003download
Chinese-LLaMA-Alpacaalpaca_data_zh_51kymcui(讯飞)51kCNMTSIgeneral instructtext-davinci-003
Luotuo-Chinese-LLM 骆驼trans_chinese_alpaca_dataLC1332(商汤)52kCNMTSIgeneral instructtext-davinci-003
Natural InstructionsAllen AI 61 task|1.5k taskAllen AI5040134MLMTCOLdiverse nlp taskshuman annotated datasets collectiondownload
belle_cnBelleGroup/train_1M_CN |BelleGroup/train_0.5M_CNBelleGroup(链家)1079517CNTS/MTSIgeneral, mathematical reasoning, dialoguetext-davinci-003download
instinwildinstinwild_ch | instinwild_en52191EN/CNMTSIgeneration, open-qa, mind-stormtext-davinci-003download
华驼(HuaTuo)中文医学知识 |肝癌SCIR-HI(哈工大)8KCNTSSI公开和自建的中文医学知识库GPT3.5
prosocial dialogallenai/prosocial-dialogallenai165681ENTSMIXdialogueGPT-3 rewrites questions + humans feedback manuallydownload
finance_engbharti/finance-alpaca68912ENTSCOLfinancial related qaGPT3.5download
xP3bigscience/xP3bigscience78883588MLMTCOLa collection of prompts & datasets across 46 of languages & 16 NLP taskshuman annotated datasets collectiondownload
fireflyYeungNLP/firefly-train-1.1M1649398CNMTCOL23 nlp taskshuman annotated datasets collectiondownload
instructswype/instruct888969ENMTCOLaugmented of GPT4All, Alpaca, open-source Meta datasetsaugmentation performed using the advanced NLP tools provided by AllenAIdownload
Code Alpacasahil280114/codealpaca20022ENTSSIcode generation, editing, optimizationtext-davinci-003download
Alpaca_GPT4alpaca_gpt4_data|alpaca_gpt4_data_zh |comparison_data_v2微软52002EN/CNMTSIgeneral instructgenerated by GPT-4 using Alpacadownload
webGPTopenai/webgpt_comparisonsopenai18994ENTSMIXinformation retrieval (IR) QAfine-tuned GPT-3, each instruction has two outputs, select better onedownload
dolly 2.0databricks/databricks-dolly-15kdatabricks15015ENTSHGclosed QA , summarization and etc, Wikipedia as referenceshuman annotateddownload
mosaicml/llm-foundrymosaicml/dolly_hhrlhfmosaicml59.3KENTSHGThis dataset is a combination of Databrick's dolly-15k dataset and a filtered subset of Anthropic's HH-RLHF.human annotated
baize 白泽alpaca_chat_data.json |medical_chat_data.json | quora_chat_data.json |stackoverflow_chat_data.jsonproject-baize653699ENMTCOLa collection from Alpaca, Quora, StackOverFlow and MedQuAD questionshuman annotated datasets collectiondownload
hh-rlhfAnthropic/hh-rlhfAnthropic284517ENTSMIXdialoguedialog between human and RLHF modelsdownload
OIG(part)laion/OIGlaion49237ENMTCOLcreated from various tasks, such as question and answeringusing data augmentation, human annotated datasets collectiondownload
GAOKAOFill-in-the-blank_Questions | Multiple-choice_Questions | Open-ended_QuestionsOpenLMLab2785CNMTCOLMultiple-choice, Fill-in-the-blank and Open-ended questions from examinationhuman annotateddownload
camel | 骆驼camel-ai/code|camel-ai/biology |camel-ai/physics |camel-ai/chemistry |camel-ai/mathcamel-ai760620ENMTSIRole-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biologgpt-3.5-turbodownload
FLAN-MuffinMuennighoff/flan1764800ENMTCOL60 nlp taskshuman annotated datasets collectiondownload
COIGCOIGBAAI|智源298428CNMTCOLcollect fron Exam, Translated, Human Value Alignment Instructions and Counterfactural Correction Multi-round Chatusing automatic tool and manual verificationdownload
GPT4Toolsgpt4tools_71k.jsonStevenGrove71446ENMTSIa collection of tool-related instructionsgpt-3.5-turbodownload
ShareChatRyokoAI/ShareGPT52KRyokoAI1663241ENMTMIXgeneral instructcrowdsourcing to collect conversations between people and ChatGPT (ShareGPT)download
Auto CoTkojima-takeshi188/zero_shot_cot/dataset |kojima-takeshi188/zero_shot_cot/logamazon-scienceENdownload
MOSS(复旦 Moss)fnlp/moss-002-sft-data| moss-003-sft-datafnlp1583595EN/CNSIdownload
ultrachatstingning/ultrachatthnlp28247446ENdownload
StackLLaMAlvwerra/stack-exchange-pairedtodoENHG
Self-Instructyizhongw/self-instruct82 KENSISI
Zhihu-KOLZhihu-KOLOpenassisent100 wSIHGZhihu data for training Open Assitant
stanfordnlp/SHPstanfordnlp/SHPstanfordnlp385 kENMTHGhuman preferences over responses
LAION-AI/Open-AssistantOpenAssistant/oasst1Openassisent84.4kENMTHGOpenAssistant Conversations Dataset (OASST1)human-generated, human-annotated
akoksal/LongFormakoksal/LongFormakoksal/LongForm30kENSIHG们从现有语料库(如 C4 和维基百科)中选择一组不同的人工文档,并通过 LLM 为给定的文档生成指令。
sail-sg/symbolic-instruction-tuningsail/symbolic-instruction-tuningsail-sg800KMLSIHuman Synthetic Examples
医疗问答 michael-wzhu/PromptCBLUEmichaelwzhu/ChatMed_Consult_Datasetmichael-wzhu110113CNSI互联网上的医疗问诊问题(110,113),反映了真实世界的不同用户/患者的医疗问诊需求。目前response都是由OpenAI GPT-3.5引擎回答的。
mbzuai-nlp/LaMini-LMMBZUAI/LaMini-instructionMBZUAI/LaMini-instruction2.58MENMTSI通过离线蒸馏从大型语言模型中提取知识
pCLUEpCLUE120 万
WizardLMvictor123/evol_instruct_70kWizardLM70kENMT

RLHF Datasets

Statistics

ProjectLinksOrgNumsLangSummary
webgpt_comparisonsOpenai19,578EnglishIn the WebGPT paper, the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total.
SHPstanfordnlp349 KEnglishSHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., SteamSHP).
rlhf-reward-datasetsyitingxie76.3 kEnglish
Dahoas/full-hh-rlhfDahoas112 kEnglishAnthropic's HH dataset reformatted into prompt, chosen, rejected samples.
Dahoas/synthetic-instruct-gptj-pairwiseDahoasEnglish
Dahoas/rm-staticDahoas76.3kEnglishSplit of hh-static used for training reward models after supervised fine-tuning.
Anthropic/hh-rlhfAnthropic22kEnglishThis RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data.
Instruction-Tuning-with-GPT-4/GPT-4-LLMInstruction-Tuning-with-GPT-452kEnglishRanked responses (Note: Data is evaluated by GPT-4 model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses"
thu-coai/Safety-Promptsthu-coai/Safety-Promptsthu-coai100kChinese中文安全prompts,用于评测和提升大模型的安全性,将模型的输出与人类的价值观对齐。
Chatgpt-Comparison-Detection projectHello-SimpleAI/HC324.3KEnglishHuman ChatGPT Comparison Corpus, 60k human answers and 27K ChatGPT answers for around 24K questions.

Open ChatLLMs

ReleaseModel_nameBaseModel_SizeDatasetsNumber of InstancesLanguage
2022-12GPT-3 Self Inst.GPT-3175BSelf-Instruct82 kEn
2023-03-03alpacaLLaMA7Balpaca_data52 kEn
2023-03-19alpaca-loraLLaMA7B 13B 30Balpaca_dataalpaca_data_cleaned52 kEn
2023-03-23Chinese-VicunaLLaMA7B 13BBELLEGuanacoDataset1MZh
2023-03-24Alpaca-CoTLLaMA7Bdataset----En Zh
2023-03-25dollydolly6Balpaca_data52 kEn
2023-03-25guanacoLLaMA7BGuanacoDataset534 kEn Zh Ja De
2023-03-28Chinese-LLaMA-AlpacaLLaMA7Balpaca_data_zhpCLUEtranslation2019zhalpaca_data、Self-Instruct2MZh
2023-03-29ColossalChatLLaMA7B 13BInstructionWild104 kEn Zh
2023-03-31LuotuoLLaMA ChatGLM7B 6Btrans_chinese_alpaca_data52kZh
2023-03-31cerebras-lora-alpacaCerebras-GPT2.7BAlpacaDataCleaned52kEn

The template

Append the new project at the end of file


[{Project-name}/{Dataset-name}]{https://github.com/link/to/project}

- [paper/project link](link)
- [dataset link](link)
- Related work: (if applicable)

Some introductions ...

The Prompt Datasets List

Alpaca -Stanford

The Alpaca of the Stanford release is a fine-tuning model for instruct-tuning based on the Meta Ai LLaMA model.

Alpaca automatically generated 52k instruction data using GPT-3.5 and used it to fine-tune the LLaMA model. Experimental results show that it can reach or even exceed the performance of GPT-3.5 on some tasks.

Instruction in the Wild

Instruction Tuning is a key component of ChatGPT. OpenAI used their user-based Instruction dataset, but unfortunately, this dataset is not open-sourced. Self-Instruct released a small instruction dataset including 175 instructions written by human labors. Standford Alpaca Team generated 52K instructions by text-davinci-003 model based on the the 175 seed instructions above.

This project targets on a larger and more diverse instruction dataset. To this end, we collected 429 instructions from ChatGPT usage screenshots and released both English and Chinese versions. We found these instructions are very diverse even if the scale is still small. We follow Alpaca to generate 52K instructions and their responses. All data can be found in data dir.

Note: This is an ongoing project. We are still collecting and improving our data. We release this dataset as early as possible to speedup our LLM research. We will also release a whitepaper soon.

JosephusCheung/GuanacoDataset

52K instruction data generated from modified self-instruct pipeline with human written 429 seed task.

Stanford Human Preferences Dataset (SHP)

SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., SteamSHP).

Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively). SHP exploits the fact that if comment A was written after comment B but has a higher score nonetheless, then A is ostensibly more preferred to B. If A had been written before B, then we could not conclude this, since its higher score could have been the result of more visibility. We chose data where the preference label is intended to reflect which response is more helpful rather than which is less harmful, the latter being the focus of much past work.

How is SHP different from Anthropic's HH-RLHF dataset? Most notably, all the data in SHP is naturally occurring and human-written, whereas the responses in HH-RLHF are machine-written, giving us two very different distributions that can complement each other.

Hello-SimpleAI/HC3

Hello-SimpleAI/HC3-Chinese

allenai/prosocial-dialog

allenai/natural-instructions

PhoebusSi/Alpaca-CoT

nomic-ai/gpt4all

bigscience/xP3

teknium1/GPTeacher

thunlp/UltraChat

cascip/ChatAlpaca

YeungNLP/firefly-train-1.1M)

orhonovich/unnatural-instructions

Instruction-Tuning-with-GPT-4/GPT-4-LLM

databrickslabs/dolly

OpenAssistant/oasst1

BELLE/data/1.5M

alpaca_chinese_dataset

Med-ChatGLM/data

pCLUE

COIG

https://github.com/FreedomIntelligence/InstructionZoo

https://github.com/lightaime/camel

The RLHF Datasets List

Anthropic/hh-rlhf

HuggingFaceH4/stack-exchange-preferences

stanfordnlp/SHP

Instruction-Tuning-with-GPT-4/GPT-4-LLM

Natural Instruction / Super-Natural Instruction

Allen AI is the first organization to try Instruction as a prompt and fine-tune LLMs. In the Natural Instruction paper, you can basically understand the labeling ideas of the instruction.

In its proposed dataset, 61 and different NLP tasks are included.

Super-Natural Instruction is a super-intensive version of Natural Instruction, which contains more than 1,600 different NLP tasks, and there are more than 76 different types of NLP tasks (such as: classification, extraction, sequence labeling).

BigScience/P3

BigScience is jointly organized by Hugging Face and French CNRS, IDRIS, GENCI, etc. It is one of the largest open source LLMs organizations.

BigScience developed the PromptSource project at the end of 2021, and open sourced a series of toolkits to help researchers build prompts based on existing NLP tasks. So far, the PromptSource project contains more than 2000 prompt templates for 270 NLP tasks.

On this basis, BigScience constructed the P3 dataset. You can find P3 data on Hugging Face Hub, and the data size of P3 is between 100M-1B.

xMTF - BigScience

Based on the English prompt, BigScience extends its prompt to multiple non-English languages.

The project contains 13 NLP tasks and is available in 46 different languages. The corresponding prompt contains an indeterminate number of languages.

After fine-tuning on the basis of multilingual, both BLOOM and T0 have realized the ideal multilingual ability.

HH-RLHF - Anthropic

Claud under Anthropic is one of the main competitors of ChatGPT.

Anthropic has open-sourced the RLHF dataset it uses in its own product line.

The original intention of the HH-RLHF project is to train Helpful and Harmless (HH) LLMs. Therefore, in addition to the quality of the project's responses, whether it is harmful information is also reflected in its human feedback.

The paper records how to use the behavior of the RLHF data Align model to human values, and records the construction method and standards of the data set.

Unnatural Instruction

Using LLMs to independently generate instruction data is an active direction in the field of instruction-tuning.

Unnatural Instruction uses GPT3 (text-davinci-002) to generate 64k instruction prompt data. And use the same model to rewrite the 64k prompt, and finally get 240k instruction data.

The paper shows that the prompts generated by LLMs in Instruct-Tuning show good results, even surpassing models such as T0 that are fine-tuned on P3 and other data.

Self-Instruct

Self-Instruct is also the idea of using LLMs to generate prompts for instruction-tuning. However, a more fine-grained generation process is used.

Concepts such as Task pool and Quality filtering were introduced to partially alleviate the noise problem of self-intrauct type data.

UnifiedSKG - HKU

UnifiedSKG has added knowledge grounding in the Text-to-Text framework, that is, in the prompt-output framework, it has added structured data for assistance.

As an example, some NLP tasks rely heavily on structured knowledge bases/databases. The idea of UnifiedSKG is to serialize the required database and embed it into the prompt. As shown below.

UnifiedSKG represents a direction in the field of LLMs that attempts to use structured knowledge to enhance performance.

Google/Flan Collection

In this project, Google merged its own Flan 2021 data with some open source instruction data (P3, super-natural instruction, etc.).

In Flan Collection's paper, Google also summarizes some key points in Flan series model training/reasoning, which may have good reference value.

The Flan Collection compiles datasets from Flan 2021, P3, Super-Natural Instructions, along with dozens more datasets into one place, formats them into a mix of zero-shot, few-shot and chain-of-thought templates

InstructDial

InstructDial is an attempt to fine-tune instructions on a specific task type. Experimental results show that after fine-tuning on dialogue instruction data, the model performs better on dialogue tasks than on very large-scale task sets.

ChatGPT Distillation Data

Public User-Shared Dialogues with ChatGPT (ShareGPT) Around 60K dialogues shared by users on ShareGPT were collected using public APIs. To maintain data quality, we deduplicated on the user-query level and removed any non-English conversations. This leaves approximately 30K examples.

Human ChatGPT Comparison Corpus (HC3) We use both the human and ChatGPT responses from the HC3 english dataset, which contains around 60K human answers and 27K ChatGPT answers for around 24K questions, resulting in a total number of around 87K question-answer examples.

Open Instruction Generalist (OIG).

We use a manually-selected subset of components from the Open Instruction Generalist dataset curated by LAION. Specifically, we use the grade-school-math-instructions, the poetry-to-songs, and the plot-screenplay-books-dialogue datasets. This results in a total of around 30k examples.

OpenAI WebGPT.

In the WebGPT paper, the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total.

Each example in the dataset contains a pair of model answers for a question, and the associated metadata. Each answer has a preference score from humans that can be used to determine which of the two answers are better.

OpenAI Summarization.

The OpenAI summarization dataset contains ~93K examples, each example consists of feedback from humans regarding the summarizations generated by a model. Human evaluators chose the superior summary from two options.

Datasets without license information

alespalla/chatbot_instruction_prompts

Contributing

Our purpose is to make this repo even better. If you are interested in contributing, please refer to HERE for instructions in contribution.

License

Awesome-Prompt-Dataset is released under the Apache 2.0 license.

Reference