Home

Awesome

awesome-text/visual-instruction-tuning-dataset

A collection of open-source instruction tuning datasets to train (text and multi-modal) chat-based LLMs (GPT-4, ChatGPT,LLaMA,Alpaca). We currently include three types of dataset:

  1. visual-instruction-tuning (e.g. image-instruction-answer)
  2. text-instruction-tuning datasets.
  3. red-teaming | Reinforcement Learning from Human Feedback (RLHF) Datasets

Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.

Lists of codebse to train your LLMs:

Size: The number of instruction tuning pairs

Lingual-Tags:

Task-Tags:

Generation-method:

Table of Contents

  1. The template
  2. The Multi-modal Instruction Dataset
  3. The Instruction tuning Dataset
  4. Reinforcement Learning from Human Feedback (RLHF) Datasets
  5. License that Allows Commercial Use

The template

Append the new project at the end of file

## [({owner}/{project-name)|Tags}]{https://github.com/link/to/project}

- summary:
- Data generation model:
- paper:
- License:
- Related: (if applicable)

The Multi-modal Instruction Datasets

(Vision-CAIR/MiniGPT-4)|5K|EN|MT|MIX

(haotian-liu/LLaVA)|150K|EN|MT|MIX

[({sunrainyg}/{InstructCV)|EN|MT|MIX}]{https://github.com/AlaaLab/InstructCV}

The Instruction-following Datasets

(tatsu-lab/Alpaca)|52K|EN|MT|SI

(gururise/Cleaned Alpaca)|52K|EN|MT|SI

(XueFuzhao/InstructionWild)|52K|EN|CN|MT|SI

(JosephusCheung/GuanacoDataset)|534K|ML|MT|SI

(Hello-SimpleAI/HC3)|24K|EN|MT|MIX

(Hello-SimpleAI/HC3-Chinese)|13K|CN|MT|MIX

(allenai/prosocial-dialog)|58K|EN|MT|MIX

(allenai/natural-instructions)|1.6K|ML|MT|HG

(bigscience/xP3)|N/A|ML|MT|MIX

(PhoebusSi/Alpaca-CoT)|500k|ML|MT|COL

(nomic-ai/gpt4all)|437k|EN|MT|COL

(teknium1/GPTeacher)|20k+|EN|MT|SI

(google-research/FLAN)|N/A|EN|MT|MIX

(thunlp/UltraChat)|280k|EN|TS|MIX

(cascip/ChatAlpaca)|10k|EN|MT|MIX

(YeungNLP/firefly-train-1.1M)|1100k|CN|MT|COL

(orhonovich/unnatural-instructions)|240K|EN|MT|MIX

(Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|CN|MT|SI

(databrickslabs/dolly)|15K|EN|MT|HG

(OpenAssistant/oasst1)|161K|ML|MT|HG

(RyokoAI/ShareGPT52K)|90K|ML|MT|SI

(zjunlp/Mol-Instructions)|2043K|ML|MT|MIX

Reinforcement Learning from Human Feedback (RLHF) | Red-Teaming Datasets

(Anthropic/hh-rlhf)|22k|EN|MT|MIX

(thu-coai/Safety-Prompts)|100k|CN|MT|MIX

(HuggingFaceH4/stack-exchange-preferences)|10741k|EN|TS|HG

(stanfordnlp/SHP)|385k|EN|MT|HG

(Instruction-Tuning-with-GPT-4/GPT-4-LLM)|52K|EN|MT|MIX

(Reddit/eli5)|500k|EN|MT|HG

License that Allows Commercial Use

Note: While these licenses permit commercial use, they may have different requirements for attribution, distribution, or modification. Be sure to review the specific terms of each license before using it in a commercial project.

Commercial use licenses:

  1. Apache License 2.0
  2. MIT License
  3. BSD 3-Clause License
  4. BSD 2-Clause License
  5. GNU Lesser General Public License v3.0 (LGPLv3)
  6. GNU Affero General Public License v3.0 (AGPLv3)
  7. Mozilla Public License 2.0 (MPL-2.0)
  8. Eclipse Public License 2.0 (EPL-2.0)
  9. Microsoft Public License (Ms-PL)
  10. Creative Commons Attribution 4.0 International (CC BY 4.0)
  11. Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
  12. zlib License
  13. Boost Software License 1.0