Awesome

awesome-text/visual-instruction-tuning-dataset

A collection of open-source instruction tuning datasets to train (text and multi-modal) chat-based LLMs (GPT-4, ChatGPT,LLaMA,Alpaca). We currently include three types of dataset:

visual-instruction-tuning (e.g. image-instruction-answer)
text-instruction-tuning datasets.
red-teaming | Reinforcement Learning from Human Feedback (RLHF) Datasets

Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.

Lists of codebse to train your LLMs:

nichtdax/awesome-totally-open-chatgpt: A codebase of totally open alternatives to ChatGPT

Size: The number of instruction tuning pairs

Lingual-Tags:

EN: Instruction datasets in English
CN: Instruction datasets in Chinese
ML: [Multi-lingual] Instruction datasets in multiple languages

Task-Tags:

MT: [Multi-task] Datasets containing multiple tasks
TS: [Task-specific] Datasets tailored for specific tasks

Generation-method:

HG: [Human Generated Dataset] Datasets created by humans
SI: [Self-Instruct] Datasets generated using self-instruct methods
MIX: [Mixed Dataset] Dataset contains both human and machine generated data
COL: [Collection of Dataset] Dataset made from a collection of other datasets

The template

Append the new project at the end of file

## [({owner}/{project-name)|Tags}]{https://github.com/link/to/project}

- summary:
- Data generation model:
- paper:
- License:
- Related: (if applicable)

The Multi-modal Instruction Datasets

(Vision-CAIR/MiniGPT-4)|5K|EN|MT|MIX

Summary: A high-quality, well-aligned (e.g. more detailed image desciption) image-text dataset created using conversation between two bots, similar to ChatCaptioner. This image-text dataset can then be used with some predefined instruction template for image-instruction-answer finetuning.
Modality: Text, Image
Data generation model: N/A
paper: MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
License: BSD 3-Clause
Related:
- Interactive ChatCaptioner for image and video

(haotian-liu/LLaVA)|150K|EN|MT|MIX

Summary: LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability.
Modality: Text, Image
Data generation model: GPT-4-0314
paper: Visual Instruction Tuning
License: CC BY-NC 4.0

[({sunrainyg}/{InstructCV)|EN|MT|MIX}]{https://github.com/AlaaLab/InstructCV}

summary: Instruction-Tuned Text-To-Image Diffusion Models As Vision Generalists
Modality: Text, Image
paper: InstructCV
License: CC BY-NC 4.0

The Instruction-following Datasets

(tatsu-lab/Alpaca)|52K|EN|MT|SI

Summary:52K data generated from modified self-instruct pipeline with human written 175 seed task.
Data generation model: text-davinci-003
paper: alpaca-blog
License: CC BY-NC 4.0

(gururise/Cleaned Alpaca)|52K|EN|MT|SI

Summary: A project that manually cleaned the Alpaca 52K Dataset
Data generation model: text-davinci-003
paper: N/A
License: CC BY-NC 4.0

(XueFuzhao/InstructionWild)|52K|EN|CN|MT|SI

Summary:52K data generated from modified self-instruct pipeline with human written 429 seed task.
Data generation model: text-davinci-003
paper: N/A
License: InstructWild dataset is intended for non-commercial research purpose only.

(JosephusCheung/GuanacoDataset)|534K|ML|MT|SI

Summary:52K instruction data generated from modified self-instruct pipeline with human written 429 seed task.
Data generation model: text-davinci-003
License: GPL-3.0

(Hello-SimpleAI/HC3)|24K|EN|MT|MIX

Summary:The the first human-ChatGPT comparison corpus (English Version), named HC3 dataset
Data generation model: gpt-3.5, human generated
paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
License: CC BY-SA 4.0

(Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX

Summary:The the first human-ChatGPT comparison corpus (Chinese Version), named HC3 dataset
Data generation model: gpt-3.5, human generated
paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
License: CC BY-SA 4.0

(allenai/prosocial-dialog)|58K|EN|MT|MIX

Summary: ProsocialDialog is the first large-scale multi-turn English dialogue dataset to teach conversational agents to respond to problematic content following social norms.
Data generation model: gpt-3.5, human generated
paper: ProsocialDialog: A Prosocial Backbone for Conversational Agents
License: CC BY 4.0

(allenai/natural-instructions)|1.6K|ML|MT|HG

Summary: A community effort to create a large collection of 1,616 diverse NLP tasks and their natural language definitions/instructions.
Data generation model: Human generated
paper: Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
License: Apache License 2.0

(bigscience/xP3)|N/A|ML|MT|MIX

Summary: [Prompt-resource] xP3 (Crosslingual Public Pool of Prompts) is a collection of prompts & datasets across 46 of languages & 16 NLP tasks.
Data generation model: N/A
paper: Crosslingual Generalization through Multitask Finetuning
License: Apache License 2.0

(PhoebusSi/Alpaca-CoT)|500k|ML|MT|COL

Summary: A datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca. Note: Their repository will continuously collect and combine various instruction tuning datasets. Github Repo
paper: N/A
License: Apache License 2.0

(nomic-ai/gpt4all)|437k|EN|MT|COL

Summary: gpt4all leverages three publicly available datasets: 1.laion/OIG, 2.pacovaldez/stackoverflow-questions 3. subset of bigscience/bloomz-p3
Data generation model: N/A
paper: GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
License: MIT License

(teknium1/GPTeacher)|20k+|EN|MT|SI

Summary: A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer
Data generation model: GPT-4
paper: N/A
License: MIT License

(google-research/FLAN)|N/A|EN|MT|MIX

Summary: The Flan Collection compiles datasets from Flan 2021, P3, Super-Natural Instructions, along with dozens more datasets into one place, formats them into a mix of zero-shot, few-shot and chain-of-thought templates
Data generation model: N/A
paper: The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
License: Apache License 2.0

(thunlp/UltraChat)|280k|EN|TS|MIX

Summary: UltraChat aims to construct an open-source, large-scale, and multi-round dialogue data. The first part of UltraChat (i.e., the Questions about the World sector) is released, which contains 280k diverse and informative dialogues. More dialogues about writing and creation, assistance on existing materials are to come.
Data generation model: GPT-3.5-turbo
paper: N/A
License: CC BY-NC 4.0

(cascip/ChatAlpaca)|10k|EN|MT|MIX

Summary: Based on the Stanford Alpaca data, ChatAlpaca extends the data to multi-turn instructions and their corresponding responses. More data (20k) and the Chinese translated version are to come.
Data generation model: GPT-3.5-turbo
paper: N/A
License: Apache License 2.0
Related: (tatsu-lab/Alpaca)|52K|EN|MT|SI

(YeungNLP/firefly-train-1.1M)|1100k|CN|MT|COL

Summary: Chinese datasets of 23 tasks combined with human-written instruction templates.
Data generation model: N/A
paper: N/A
License: N/A

(orhonovich/unnatural-instructions)|240K|EN|MT|MIX

Summary: 64K examples by prompting a language model with three seed examples of instructions and eliciting a fourth. Then the set is expanded to 240K by prompting the model to rephrase each instruction.
Data generation model: text-davinci-002
paper: Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
License: MIT License

(Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI

Summary: 52K instruction-following data generated by GPT-4 with the original Alpaca prompts & Alpaca prompts translated into Chinese by ChatGPT + 9K instruction-following data generated by GPT-4 with prompts in Unnatural Instruction.
Data generation model: GPT-4
paper: Instruction Tuning with GPT-4
License: CC BY-NC 4.0
Related:
- (tatsu-lab/Alpaca)|52K|EN|MT|SI
- (orhonovich/unnatural-instructions)|240K|EN|MT|MIX

(databrickslabs/dolly)|15K|EN|MT|HG

Summary: This datset was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
Data generation model: N/A
paper: Free Dolly
License: CC BY-SA 3.0

(OpenAssistant/oasst1)|161K|ML|MT|HG

Summary: OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings.
Data generation model: N/A
paper: OpenAssistant Conversations - Democratizing Large Language Model Alignment
License: Apache License 2.0

(RyokoAI/ShareGPT52K)|90K|ML|MT|SI

Summary: 90,000 conversations scraped via the ShareGPT API before it was shut down. These conversations include both user prompts and responses from OpenAI's ChatGPT.
Data generation model: GPT-4,GPT-3.5
paper: N/A
License: CC0 1.0 Universal

(zjunlp/Mol-Instructions)|2043K|ML|MT|MIX

Summary: An open, large-scale biomolecular instruction dataset consisting of 148,4K molecule-oriented, 505K protein-oriented, and 53K biomolecular text instructions.
Data generation model: GPT-3.5
paper: Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
License: CC BY 4.0

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets

(Anthropic/hh-rlhf)|22k|EN|MT|MIX

Summary: This RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data.
Data generation model: Anthropic RL-CAI 52B
paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
License: MIT License
Related:
- (Hello-SimpleAI/HC3)|24K|EN|MT|MIX
- (Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX

(thu-coai/Safety-Prompts)|100k|CN|MT|MIX

Summary: Chinese safety prompts for evaluating and improving the safety of LLMs. This repository includes 100k Chinese security scene prompts and ChatGPT responses, covering various security scenarios and command attacks. It can be used for comprehensive evaluation and improvement of model security, as well as enhancing the model's knowledge of security, aligning model output with human values.
Data generation model: GPT-3.5
paper: Safety Assessment of Chinese Large Language Models
License: Apache License 2.0

(HuggingFaceH4/stack-exchange-preferences)|10741k|EN|TS|HG

Summary: This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training.
Data generation model: N/A
paper: A General Language Assistant as a Laboratory for Alignment
License: CC BY-SA 4.0
Related:
- stack-exchange-paired

(stanfordnlp/SHP)|385k|EN|MT|HG

Summary: Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively).
Data generation model: N/A
paper: N/A
License: N/A

(Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|MT|MIX

Summary: Ranked responses (Note: Data is evaluated by GPT-4 model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses"
Data generation model: GPT-4
paper: Instruction Tuning with GPT-4
License: CC BY-NC 4.0
Related:
- (tatsu-lab/Alpaca)|52K|EN|MT|SI

(Reddit/eli5)|500k|EN|MT|HG

summary: This dataset contains questions and answers from the subreddits r/explainlikeimfive, r/askhistorians and r/askscience.
Data generation model: N/A
paper: N/A
License: N/A
Related: eli5 dataset a transformation of the eli5 dataset in a format similar to stack-exchange-paired.

License that Allows Commercial Use

Note: While these licenses permit commercial use, they may have different requirements for attribution, distribution, or modification. Be sure to review the specific terms of each license before using it in a commercial project.

Commercial use licenses:

Apache License 2.0
MIT License
BSD 3-Clause License
BSD 2-Clause License
GNU Lesser General Public License v3.0 (LGPLv3)
GNU Affero General Public License v3.0 (AGPLv3)
Mozilla Public License 2.0 (MPL-2.0)
Eclipse Public License 2.0 (EPL-2.0)
Microsoft Public License (Ms-PL)
Creative Commons Attribution 4.0 International (CC BY 4.0)
Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
zlib License
Boost Software License 1.0