Awesome
<div align="center">Awesome Instruction Datasets
</div> <div align="center">中文 | English
</div>Contents
- Awesome Prompt datasets
- Contents
- Introduction
- Prompt Datasets
- RLHF Datasets
- The template
- The Prompt Datasets List
- Alpaca -Stanford
- Instruction in the Wild
- JosephusCheung/GuanacoDataset
- Stanford Human Preferences Dataset (SHP)
- Hello-SimpleAI/HC3
- Hello-SimpleAI/HC3-Chinese
- allenai/prosocial-dialog
- allenai/natural-instructions
- PhoebusSi/Alpaca-CoT
- nomic-ai/gpt4all
- bigscience/xP3
- teknium1/GPTeacher
- thunlp/UltraChat
- cascip/ChatAlpaca
- YeungNLP/firefly-train-1.1M)
- orhonovich/unnatural-instructions
- Instruction-Tuning-with-GPT-4/GPT-4-LLM
- databrickslabs/dolly
- OpenAssistant/oasst1
- BELLE/data/1.5M
- alpaca_chinese_dataset
- Med-ChatGLM/data
- pCLUE
- COIG
- The RLHF Datasets List
- Anthropic/hh-rlhf
- HuggingFaceH4/stack-exchange-preferences
- stanfordnlp/SHP
- Instruction-Tuning-with-GPT-4/GPT-4-LLM
- Natural Instruction / Super-Natural Instruction
- BigScience/P3
- xMTF - BigScience
- HH-RLHF - Anthropic
- Unnatural Instruction
- Self-Instruct
- UnifiedSKG - HKU
- Google/Flan Collection
- InstructDial
- ChatGPT Distillation Data
- Open Instruction Generalist (OIG).
- OpenAI WebGPT.
- OpenAI Summarization.
- Datasets without license information
- Contributing
- License
Introduction
"Welcome to 'awesome-prompt-datasets', a comprehensive collection of high-quality open-source instruction tuning datasets to train chat-based LLMs (ChatGPT,LLaMA,Alpaca)。
Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.
With 'awesome-prompt-dataset', you can accelerate your research and development in NLP and unlock new opportunities for innovation. Let's explore the possibilities together!"
Prompt Datasets
Referring to this (@yaodongC), we labeled each collected dataset according to the following rules:
(Lang)Lingual-Tags:
- EN: Instruction datasets in English
- CN: Instruction datasets in Chinese
- ML: [Multi-lingual] Instruction datasets in multiple languages
(Task)Task-Tags:
- MT: [Multi-task] Datasets containing multiple tasks
- TS: [Task-specific] Datasets tailored for specific tasks
(Gen)Generation-method:
- HG: [Human Generated Dataset] Datasets created by humans
- SI: [Self-Instruct] Datasets generated using self-instruct methods
- MIX: [Mixed Dataset] Dataset contains both human and machine generated data
- COL: [Collection of Dataset] Dataset made from a collection of other datasets
Statistics
Project | Datasets | Org | Nums | Lang | Task | Gen | Type | Src | Url |
---|---|---|---|---|---|---|---|---|---|
Chain of Thought | cot_data |few_shot_data | 74771 | EN/CN | MT | HG | instruct with cot reasoning | annotating CoT on existing data | download | |
GPT4all | nomic-ai/gpt4all-j-prompt-generations | nomic-ai | 806199 | EN | MT | COL | code, storys and dialogs | distillation from GPT-3.5-turbo | download |
GPTeacher | GPT-4 General-Instruct |Roleplay-Instruct |Code-Instruct | Toolformer | teknium1 | 29013 | EN | MT | SI | general, roleplay, toolformer | GPT-4 & toolformer | download |
Guanaco | JosephusCheung/GuanacoDataset | JosephusCheung | 534610 | ML | MT | SI | various linguistic tasks | text-davinci-003 | download |
HC3 | Hello-SimpleAI/HC3 | Hello-SimpleAI | 万得资讯 | 37175 | EN/CN | TS | MIX | dialogue evaluation | human or ChatGPT | download |
HC3-Chinese | Hello-SimpleAI/HC3-Chinese | Hello-SimpleAI|万得资讯 | 13k | CN | TS | MIX | dialogue evaluation | human or ChatGPT | |
alpaca | tatsu-lab/alpaca | tatsu-lab | 52002 | EN | MT | SI | general instruct | text-davinci-003 | download |
AlpacaDataCleaned | yahma/alpaca-cleaned | yahma | 52k | EN | MT | SI | general instruct | text-davinci-003 | download |
Chinese-LLaMA-Alpaca | alpaca_data_zh_51k | ymcui(讯飞) | 51k | CN | MT | SI | general instruct | text-davinci-003 | |
Luotuo-Chinese-LLM 骆驼 | trans_chinese_alpaca_data | LC1332(商汤) | 52k | CN | MT | SI | general instruct | text-davinci-003 | |
Natural Instructions | Allen AI 61 task|1.5k task | Allen AI | 5040134 | ML | MT | COL | diverse nlp tasks | human annotated datasets collection | download |
belle_cn | BelleGroup/train_1M_CN |BelleGroup/train_0.5M_CN | BelleGroup(链家) | 1079517 | CN | TS/MT | SI | general, mathematical reasoning, dialogue | text-davinci-003 | download |
instinwild | instinwild_ch | instinwild_en | 52191 | EN/CN | MT | SI | generation, open-qa, mind-storm | text-davinci-003 | download | |
华驼(HuaTuo) | 中文医学知识 |肝癌 | SCIR-HI(哈工大) | 8K | CN | TS | SI | 公开和自建的中文医学知识库 | GPT3.5 | |
prosocial dialog | allenai/prosocial-dialog | allenai | 165681 | EN | TS | MIX | dialogue | GPT-3 rewrites questions + humans feedback manually | download |
finance_en | gbharti/finance-alpaca | 68912 | EN | TS | COL | financial related qa | GPT3.5 | download | |
xP3 | bigscience/xP3 | bigscience | 78883588 | ML | MT | COL | a collection of prompts & datasets across 46 of languages & 16 NLP tasks | human annotated datasets collection | download |
firefly | YeungNLP/firefly-train-1.1M | 1649398 | CN | MT | COL | 23 nlp tasks | human annotated datasets collection | download | |
instruct | swype/instruct | 888969 | EN | MT | COL | augmented of GPT4All, Alpaca, open-source Meta datasets | augmentation performed using the advanced NLP tools provided by AllenAI | download | |
Code Alpaca | sahil280114/codealpaca | 20022 | EN | TS | SI | code generation, editing, optimization | text-davinci-003 | download | |
Alpaca_GPT4 | alpaca_gpt4_data|alpaca_gpt4_data_zh |comparison_data_v2 | 微软 | 52002 | EN/CN | MT | SI | general instruct | generated by GPT-4 using Alpaca | download |
webGPT | openai/webgpt_comparisons | openai | 18994 | EN | TS | MIX | information retrieval (IR) QA | fine-tuned GPT-3, each instruction has two outputs, select better one | download |
dolly 2.0 | databricks/databricks-dolly-15k | databricks | 15015 | EN | TS | HG | closed QA , summarization and etc, Wikipedia as references | human annotated | download |
mosaicml/llm-foundry | mosaicml/dolly_hhrlhf | mosaicml | 59.3K | EN | TS | HG | This dataset is a combination of Databrick's dolly-15k dataset and a filtered subset of Anthropic's HH-RLHF. | human annotated | |
baize 白泽 | alpaca_chat_data.json |medical_chat_data.json | quora_chat_data.json |stackoverflow_chat_data.json | project-baize | 653699 | EN | MT | COL | a collection from Alpaca, Quora, StackOverFlow and MedQuAD questions | human annotated datasets collection | download |
hh-rlhf | Anthropic/hh-rlhf | Anthropic | 284517 | EN | TS | MIX | dialogue | dialog between human and RLHF models | download |
OIG(part) | laion/OIG | laion | 49237 | EN | MT | COL | created from various tasks, such as question and answering | using data augmentation, human annotated datasets collection | download |
GAOKAO | Fill-in-the-blank_Questions | Multiple-choice_Questions | Open-ended_Questions | OpenLMLab | 2785 | CN | MT | COL | Multiple-choice, Fill-in-the-blank and Open-ended questions from examination | human annotated | download |
camel | 骆驼 | camel-ai/code|camel-ai/biology |camel-ai/physics |camel-ai/chemistry |camel-ai/math | camel-ai | 760620 | EN | MT | SI | Role-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biolog | gpt-3.5-turbo | download |
FLAN-Muffin | Muennighoff/flan | 1764800 | EN | MT | COL | 60 nlp tasks | human annotated datasets collection | download | |
COIG | COIG | BAAI|智源 | 298428 | CN | MT | COL | collect fron Exam, Translated, Human Value Alignment Instructions and Counterfactural Correction Multi-round Chat | using automatic tool and manual verification | download |
GPT4Tools | gpt4tools_71k.json | StevenGrove | 71446 | EN | MT | SI | a collection of tool-related instructions | gpt-3.5-turbo | download |
ShareChat | RyokoAI/ShareGPT52K | RyokoAI | 1663241 | EN | MT | MIX | general instruct | crowdsourcing to collect conversations between people and ChatGPT (ShareGPT) | download |
Auto CoT | kojima-takeshi188/zero_shot_cot/dataset |kojima-takeshi188/zero_shot_cot/log | amazon-science | EN | download | |||||
MOSS(复旦 Moss) | fnlp/moss-002-sft-data| moss-003-sft-data | fnlp | 1583595 | EN/CN | SI | download | |||
ultrachat | stingning/ultrachat | thnlp | 28247446 | EN | download | ||||
StackLLaMA | lvwerra/stack-exchange-paired | todo | EN | HG | |||||
Self-Instruct | yizhongw/self-instruct | 82 K | EN | SI | SI | ||||
Zhihu-KOL | Zhihu-KOL | Openassisent | 100 w | SI | HG | Zhihu data for training Open Assitant | |||
stanfordnlp/SHP | stanfordnlp/SHP | stanfordnlp | 385 k | EN | MT | HG | human preferences over responses | ||
LAION-AI/Open-Assistant | OpenAssistant/oasst1 | Openassisent | 84.4k | EN | MT | HG | OpenAssistant Conversations Dataset (OASST1) | human-generated, human-annotated | |
akoksal/LongForm | akoksal/LongForm | akoksal/LongForm | 30k | EN | SI | HG | 们从现有语料库(如 C4 和维基百科)中选择一组不同的人工文档,并通过 LLM 为给定的文档生成指令。 | ||
sail-sg/symbolic-instruction-tuning | sail/symbolic-instruction-tuning | sail-sg | 800K | ML | SI | Human Synthetic Examples | |||
医疗问答 michael-wzhu/PromptCBLUE | michaelwzhu/ChatMed_Consult_Dataset | michael-wzhu | 110113 | CN | SI | 互联网上的医疗问诊问题(110,113),反映了真实世界的不同用户/患者的医疗问诊需求。目前response都是由OpenAI GPT-3.5 引擎回答的。 | |||
mbzuai-nlp/LaMini-LM | MBZUAI/LaMini-instruction | MBZUAI/LaMini-instruction | 2.58M | EN | MT | SI | 通过离线蒸馏从大型语言模型中提取知识 | ||
pCLUE | pCLUE | 120 万 | |||||||
WizardLM | victor123/evol_instruct_70k | WizardLM | 70k | EN | MT | ||||
RLHF Datasets
Statistics
Project | Links | Org | Nums | Lang | Summary |
---|---|---|---|---|---|
webgpt_comparisons | Openai | 19,578 | English | In the WebGPT paper, the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total. | |
SHP | stanfordnlp | 349 K | English | SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., SteamSHP). | |
rlhf-reward-datasets | yitingxie | 76.3 k | English | ||
Dahoas/full-hh-rlhf | Dahoas | 112 k | English | Anthropic's HH dataset reformatted into prompt, chosen, rejected samples. | |
Dahoas/synthetic-instruct-gptj-pairwise | Dahoas | English | |||
Dahoas/rm-static | Dahoas | 76.3k | English | Split of hh-static used for training reward models after supervised fine-tuning. | |
Anthropic/hh-rlhf | Anthropic | 22k | English | This RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data. | |
Instruction-Tuning-with-GPT-4/GPT-4-LLM | Instruction-Tuning-with-GPT-4 | 52k | English | Ranked responses (Note: Data is evaluated by GPT-4 model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses" | |
thu-coai/Safety-Prompts | thu-coai/Safety-Prompts | thu-coai | 100k | Chinese | 中文安全prompts,用于评测和提升大模型的安全性,将模型的输出与人类的价值观对齐。 |
Chatgpt-Comparison-Detection project | Hello-SimpleAI/HC3 | 24.3K | English | Human ChatGPT Comparison Corpus, 60k human answers and 27K ChatGPT answers for around 24K questions. |
Open ChatLLMs
Release | Model_name | Base | Model_Size | Datasets | Number of Instances | Language |
---|---|---|---|---|---|---|
2022-12 | GPT-3 Self Inst. | GPT-3 | 175B | Self-Instruct | 82 k | En |
2023-03-03 | alpaca | LLaMA | 7B | alpaca_data | 52 k | En |
2023-03-19 | alpaca-lora | LLaMA | 7B 13B 30B | alpaca_data、alpaca_data_cleaned | 52 k | En |
2023-03-23 | Chinese-Vicuna | LLaMA | 7B 13B | BELLE、GuanacoDataset | 1M | Zh |
2023-03-24 | Alpaca-CoT | LLaMA | 7B | dataset | ---- | En Zh |
2023-03-25 | dolly | dolly | 6B | alpaca_data | 52 k | En |
2023-03-25 | guanaco | LLaMA | 7B | GuanacoDataset | 534 k | En Zh Ja De |
2023-03-28 | Chinese-LLaMA-Alpaca | LLaMA | 7B | alpaca_data_zh、pCLUE、translation2019zh、alpaca_data、Self-Instruct | 2M | Zh |
2023-03-29 | ColossalChat | LLaMA | 7B 13B | InstructionWild | 104 k | En Zh |
2023-03-31 | Luotuo | LLaMA ChatGLM | 7B 6B | trans_chinese_alpaca_data | 52k | Zh |
2023-03-31 | cerebras-lora-alpaca | Cerebras-GPT | 2.7B | AlpacaDataCleaned | 52k | En |
The template
Append the new project at the end of file
[{Project-name}/{Dataset-name}]{https://github.com/link/to/project}
- [paper/project link](link)
- [dataset link](link)
- Related work: (if applicable)
Some introductions ...
The Prompt Datasets List
Alpaca -Stanford
- Paper/Project Link
- Dataset Link
- Data generation model: text-davinci-003
- Cost: $600
The Alpaca of the Stanford release is a fine-tuning model for instruct-tuning based on the Meta Ai LLaMA model.
Alpaca automatically generated 52k instruction data using GPT-3.5 and used it to fine-tune the LLaMA model. Experimental results show that it can reach or even exceed the performance of GPT-3.5 on some tasks.
Instruction in the Wild
- Paper/Project Link
- Dataset Link
- Data generation model: text-davinci-003
Instruction Tuning is a key component of ChatGPT. OpenAI used their user-based Instruction dataset, but unfortunately, this dataset is not open-sourced. Self-Instruct released a small instruction dataset including 175 instructions written by human labors. Standford Alpaca Team generated 52K instructions by text-davinci-003 model based on the the 175 seed instructions above.
This project targets on a larger and more diverse instruction dataset. To this end, we collected 429 instructions from ChatGPT usage screenshots and released both English and Chinese versions. We found these instructions are very diverse even if the scale is still small. We follow Alpaca to generate 52K instructions and their responses. All data can be found in data dir.
Note: This is an ongoing project. We are still collecting and improving our data. We release this dataset as early as possible to speedup our LLM research. We will also release a whitepaper soon.
JosephusCheung/GuanacoDataset
- Data generation model: text-davinci-003
- Cost: $6000
52K instruction data generated from modified self-instruct pipeline with human written 429 seed task.
Stanford Human Preferences Dataset (SHP)
SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., SteamSHP).
Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively). SHP exploits the fact that if comment A was written after comment B but has a higher score nonetheless, then A is ostensibly more preferred to B. If A had been written before B, then we could not conclude this, since its higher score could have been the result of more visibility. We chose data where the preference label is intended to reflect which response is more helpful rather than which is less harmful, the latter being the focus of much past work.
How is SHP different from Anthropic's HH-RLHF dataset? Most notably, all the data in SHP is naturally occurring and human-written, whereas the responses in HH-RLHF are machine-written, giving us two very different distributions that can complement each other.
Hello-SimpleAI/HC3
- Summary:The the first human-ChatGPT comparison corpus (English Version), named HC3 dataset
- Data generation model:
gpt-3.5
,human generated
- paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
- Cost: N/A
Hello-SimpleAI/HC3-Chinese
- Summary:The the first human-ChatGPT comparison corpus (Chinese Version), named HC3 dataset
- Data generation model:
gpt-3.5
,human generated
- paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
- Cost: N/A
allenai/prosocial-dialog
- Summary: ProsocialDialog is the first large-scale multi-turn English dialogue dataset to teach conversational agents to respond to problematic content following social norms.
- Data generation model:
gpt-3.5
,human generated
- paper: ProsocialDialog: A Prosocial Backbone for Conversational Agents
- Cost: N/A
allenai/natural-instructions
- Summary: A community effort to create a large collection of
1,616 diverse NLP tasks
and their natural language definitions/instructions. - Data generation model:
Human generated
- paper: Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
- Cost: N/A
PhoebusSi/Alpaca-CoT
- Summary: A datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca. Note: Their repository will continuously collect various instruction tuning datasets. Github Repo
- paper: N/A
- Cost: N/A
nomic-ai/gpt4all
- Summary: gpt4all leverages three publicly available datasets: 1.laion/OIG, 2.pacovaldez/stackoverflow-questions 3. subset of bigscience/bloomz-p3
- Data generation model: N/A
- paper: GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
- Cost: $500
bigscience/xP3
- Summary: [Prompt-resource] xP3 (Crosslingual Public Pool of Prompts) is a collection of prompts & datasets across 46 of languages & 16 NLP tasks.
- Data generation model: N/A
- paper: Crosslingual Generalization through Multitask Finetuning
- Cost: N/A
teknium1/GPTeacher
- Summary: A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer
- Data generation model:
GPT-4
- paper: N/A
- Cost: N/A
thunlp/UltraChat
- Summary: UltraChat aims to construct an open-source, large-scale, and multi-round dialogue data. The first part of UltraChat (i.e., the Questions about the World sector) is released, which contains 280k diverse and informative dialogues. More dialogues about writing and creation, assistance on existing materials are to come.
- Data generation model:
GPT-3.5-turbo
- paper: N/A
- Cost: N/A
cascip/ChatAlpaca
- Summary: Based on the Stanford Alpaca data, ChatAlpaca extends the data to multi-turn instructions and their corresponding responses. More data (20k) and the Chinese translated version are to come.
- Data generation model:
GPT-3.5-turbo
- paper: N/A
- Cost: N/A
- Related: (tatsu-lab/Alpaca)|52K|EN|MT|SI
YeungNLP/firefly-train-1.1M)
- Summary: Chinese datasets of 23 tasks combined with human-written instruction templates.
- Data generation model: N/A
- paper: N/A
- Cost: N/A
orhonovich/unnatural-instructions
- Summary: 64K examples by prompting a language model with three seed examples of instructions and eliciting a fourth. Then the set is expanded to 240K by prompting the model to rephrase each instruction.
- Data generation model:
text-davinci-002
- paper: Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
- Cost: N/A
Instruction-Tuning-with-GPT-4/GPT-4-LLM
- Summary: 52K instruction-following data generated by GPT-4 with the original Alpaca prompts & Alpaca prompts translated into Chinese by ChatGPT + 9K instruction-following data generated by GPT-4 with prompts in Unnatural Instruction.
- Data generation model:
GPT-4
- paper: Instruction Tuning with GPT-4
- Cost: N/A
- Related: -(tatsu-lab/Alpaca)|52K|EN|MT|SI -(orhonovich/unnatural-instructions)|240K|EN|MT|MIX
databrickslabs/dolly
- Summary: This datset was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
- Data generation model: N/A
- paper: Free Dolly
- Cost: N/A
OpenAssistant/oasst1
- Summary: OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings.
- Data generation model: N/A
- paper: OpenAssistant Conversations - Democratizing Large Language Model Alignment
- Cost: N/A
BELLE/data/1.5M
- 下载地址: https://github.com/LianjiaTech/BELLE/tree/main/data/1.5M
- 数据量: 1.5M
- 生成方式: self-instruct,使用了中文种子任务,以及openai的text-davinci-003接口
- 涉及任务: 包含175个种子任务,https://github.com/LianjiaTech/BELLE/blob/main/data/1.5M/zh_seed_tasks.json
- 数据示例: https://huggingface.co/datasets
alpaca_chinese_dataset
- 下载地址: https://github.com/hikariming/alpaca_chinese_dataset
- 数据量: 52k
- 生成方式: 借助chatgpt对原始的stanford_alpaca做机器翻译,并加入人工校验来保证质量
- 涉及任务: 与原始的stanford_alpaca一致,可以在原项目的seed_task.json中查到全部任务
Med-ChatGLM/data
- 下载地址: https://github.com/SCIR-HI/Med-ChatGLM
- 数据量: 7k
- 生成方式: 利用GPT3.5接口围绕医学知识库构建问答数据,并设置了多种Prompt形式来充分利用知识
- 涉及任务: 医学领域相关的问答,包含并发症,高危因素,组织学检查,临床症状,药物治疗,辅助治疗
pCLUE
- 下载地址: https://github.com/CLUEbenchmark/pCLUE
- 数据量: 1.2M
- 生成方式: 通过原有的NLP任务数据集,结合特定的prompt模板生成
- 涉及任务: 包含9个NLP数据集,涉及的NLP任务有文本分类/自然语言推理/语义匹配/指代消解/关键词识别/阅读理解
COIG
-
数据量:
-
- Translated Instructions (67,798)
- Exam Instructions (63,532)
- Human Value Alignment Instructions (34,471)
- Counterfactural Correction Multi-round Chat (13,653)
- Leetcode Instructions (11,737)
-
生成方式: 融合了多个领域的数据,具体可以参考论文Chinese Open Instruction Generalist: A Preliminary Release
https://github.com/FreedomIntelligence/InstructionZoo
https://github.com/lightaime/camel
The RLHF Datasets List
Anthropic/hh-rlhf
- Summary: This RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data.
- Data generation model:
Anthropic RL-CAI 52B
- paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- Cost: N/A
HuggingFaceH4/stack-exchange-preferences
- Summary: This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training.
- Data generation model: N/A
- paper: A General Language Assistant as a Laboratory for Alignment
- Cost: N/A
stanfordnlp/SHP
- Summary: Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively).
- Data generation model: N/A
- paper: N/A
- Cost: N/A
Instruction-Tuning-with-GPT-4/GPT-4-LLM
- Summary: Ranked responses (Note: Data is evaluated by
GPT-4
model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses" - Data generation model:
GPT-4
- paper: Instruction Tuning with GPT-4
- Cost: N/A
- Related: -(tatsu-lab/Alpaca)|52K|EN|MT|SI
Natural Instruction / Super-Natural Instruction
Allen AI is the first organization to try Instruction as a prompt and fine-tune LLMs. In the Natural Instruction paper, you can basically understand the labeling ideas of the instruction.
In its proposed dataset, 61 and different NLP tasks are included.
Super-Natural Instruction is a super-intensive version of Natural Instruction, which contains more than 1,600 different NLP tasks, and there are more than 76 different types of NLP tasks (such as: classification, extraction, sequence labeling).
BigScience/P3
BigScience is jointly organized by Hugging Face and French CNRS, IDRIS, GENCI, etc. It is one of the largest open source LLMs organizations.
BigScience developed the PromptSource project at the end of 2021, and open sourced a series of toolkits to help researchers build prompts based on existing NLP tasks. So far, the PromptSource project contains more than 2000 prompt templates for 270 NLP tasks.
On this basis, BigScience constructed the P3 dataset. You can find P3 data on Hugging Face Hub, and the data size of P3 is between 100M-1B.
xMTF - BigScience
Based on the English prompt, BigScience extends its prompt to multiple non-English languages.
The project contains 13 NLP tasks and is available in 46 different languages. The corresponding prompt contains an indeterminate number of languages.
After fine-tuning on the basis of multilingual, both BLOOM and T0 have realized the ideal multilingual ability.
HH-RLHF - Anthropic
Claud under Anthropic is one of the main competitors of ChatGPT.
Anthropic has open-sourced the RLHF dataset it uses in its own product line.
The original intention of the HH-RLHF project is to train Helpful and Harmless (HH) LLMs. Therefore, in addition to the quality of the project's responses, whether it is harmful information is also reflected in its human feedback.
The paper records how to use the behavior of the RLHF data Align model to human values, and records the construction method and standards of the data set.
Unnatural Instruction
Using LLMs to independently generate instruction data is an active direction in the field of instruction-tuning.
Unnatural Instruction uses GPT3 (text-davinci-002) to generate 64k instruction prompt data. And use the same model to rewrite the 64k prompt, and finally get 240k instruction data.
The paper shows that the prompts generated by LLMs in Instruct-Tuning show good results, even surpassing models such as T0 that are fine-tuned on P3 and other data.
Self-Instruct
Self-Instruct is also the idea of using LLMs to generate prompts for instruction-tuning. However, a more fine-grained generation process is used.
Concepts such as Task pool and Quality filtering were introduced to partially alleviate the noise problem of self-intrauct type data.
UnifiedSKG - HKU
UnifiedSKG has added knowledge grounding in the Text-to-Text framework, that is, in the prompt-output framework, it has added structured data for assistance.
As an example, some NLP tasks rely heavily on structured knowledge bases/databases. The idea of UnifiedSKG is to serialize the required database and embed it into the prompt. As shown below.
UnifiedSKG represents a direction in the field of LLMs that attempts to use structured knowledge to enhance performance.
Google/Flan Collection
In this project, Google merged its own Flan 2021 data with some open source instruction data (P3, super-natural instruction, etc.).
In Flan Collection's paper, Google also summarizes some key points in Flan series model training/reasoning, which may have good reference value.
The Flan Collection compiles datasets from Flan 2021, P3, Super-Natural Instructions, along with dozens more datasets into one place, formats them into a mix of zero-shot, few-shot and chain-of-thought templates
InstructDial
InstructDial is an attempt to fine-tune instructions on a specific task type. Experimental results show that after fine-tuning on dialogue instruction data, the model performs better on dialogue tasks than on very large-scale task sets.
ChatGPT Distillation Data
Public User-Shared Dialogues with ChatGPT (ShareGPT) Around 60K dialogues shared by users on ShareGPT were collected using public APIs. To maintain data quality, we deduplicated on the user-query level and removed any non-English conversations. This leaves approximately 30K examples.
Human ChatGPT Comparison Corpus (HC3) We use both the human and ChatGPT responses from the HC3 english dataset, which contains around 60K human answers and 27K ChatGPT answers for around 24K questions, resulting in a total number of around 87K question-answer examples.
Open Instruction Generalist (OIG).
We use a manually-selected subset of components from the Open Instruction Generalist dataset curated by LAION. Specifically, we use the grade-school-math-instructions, the poetry-to-songs, and the plot-screenplay-books-dialogue datasets. This results in a total of around 30k examples.
OpenAI WebGPT.
In the WebGPT paper, the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total.
Each example in the dataset contains a pair of model answers for a question, and the associated metadata. Each answer has a preference score from humans that can be used to determine which of the two answers are better.
OpenAI Summarization.
The OpenAI summarization dataset contains ~93K examples, each example consists of feedback from humans regarding the summarizations generated by a model. Human evaluators chose the superior summary from two options.
Datasets without license information
alespalla/chatbot_instruction_prompts
- Summary: A compilation of
tatsu-lab/alpaca
,Dahoas/instruct-human-assistant-prompt
,allenai/prosocial-dialog
- Data generation model: N/A
- paper: N/A
- Cost: N/A
Contributing
Our purpose is to make this repo even better. If you are interested in contributing, please refer to HERE for instructions in contribution.
License
Awesome-Prompt-Dataset
is released under the Apache 2.0 license.