Home

Awesome

<div align="center"> <h1>💾 LLM Datasets</h1> <p> 🐦 <a href="https://twitter.com/maximelabonne">Follow me on X</a> • 🤗 <a href="https://huggingface.co/mlabonne">Hugging Face</a> • 💻 <a href="https://mlabonne.github.io/blog">Blog</a> • 📙 <a href="https://github.com/PacktPublishing/Hands-On-Graph-Neural-Networks-Using-Python">Hands-on GNN</a> </p> <p><em>High-quality datasets, tools, and concepts for LLM fine-tuning.</em></p> </div> <br/>

👍 What is a good dataset?

Data is the most valuable asset in LLM development. While datasets can't be directly evaluated like models, high-quality datasets have the following characteristics:

Measuring accuracy can be easy in the case of mathematical problems using a Python interpreter, or near-impossible with open-ended, subjective questions. On the other hand, clustering datasets by topic is a good way of measuring diversity. Finally, complexity can be assessed using other LLMs acting like judges.

📅 Open SFT datasets

Once a model has been pre-trained on a next-token prediction task, supervised fine-tuning is used to turn it into an assistant capable of answering questions and achieving tasks. These datasets contain pairs of instructions and outputs to train LLMs to go beyond their pre-training objective. All the datasets listed here should be under permissive licensing (Apache 2.0, MIT, cc-by-4.0, etc.).

General-purpose

The goal of general-purpose datasets is to transform base models into versatile and capable assistants by exposing them to a wide range of high-quality data. These datasets often include a diverse mix of real-world and synthetic data, commonly generated using models like GPT-4.

Dataset#AuthorsDateNotes
Infinity-Instruct7.45MBAAIAug 2024High-quality evolved samples based on a collection of open-source datasets.
WebInstructSub2.39MYue et al.May 2024Instructions created by retrieving document from Common Crawl, extracting QA pairs, and refining them. See the MAmmoTH2 paper (this is a subset).
The-Tome1.75MArcee AIJul 2024Reranked and filtered collection of datasets with a focus on instruction following. See my 100k subset.
Hercules v4.51.72MSebastian GabarainApr 2024Large-scale general-purpose dataset with math, code, RP, etc. See v4 for the list of datasets.
Dolphin-2.91.39MCognitive ComputationsApr 2023Large-scale general-purpose dataset used by the Dolphin models.
WildChat-1M1.04MZhao et al.May 2023Real conversations between human users and GPT-3.5/4, including metadata. See the WildChat paper.
OpenHermes-2.51MTekniumNov 2023Another large-scale dataset used by the OpenHermes models.
SlimOrca518kLian et al.Sep 2023Curated subset of OpenOrca using GPT-4-as-a-judge to remove wrong answers.
Tulu V2 Mix326kIvison et al.Nov 2023Mix of high-quality datasets. See Tulu 2 paper.
UltraInteract SFT289kYuan et al.Apr 2024Focus on math, coding, and logic tasks with step-by-step answers. See Eurus paper.
NeurIPS-LLM-data204kJindal et al.Nov 2023Winner of NeurIPS LLM Efficiency Challenge, with an interesting data preparation strategy.
UltraChat 200k200kTunstall et al., Ding et al.Oct 2023Heavily filtered version of the UItraChat dataset, consisting of 1.4M dialogues generated by ChatGPT.
WizardLM_evol_instruct_V2143kXu et al.Jun 2023Latest version of Evol-Instruct applied to Alpaca and ShareGPT data. See WizardLM paper.
Synthia-v1.3119kMigel TisseraNov 2023High-quality synthetic data generated using GPT-4.
oasst184.4kKöpf et al.Mar 2023Human-generated assistant-style conversation corpus in 35 different languages. See OASST1 paper and oasst2.
WizardLM_evol_instruct_70k70kXu et al.Apr 2023Evol-Instruct applied to Alpaca and ShareGPT data. See WizardLM paper.
airoboros-3.258.7kJon DurbinDec 2023High-quality uncensored dataset.
ShareGPT_Vicuna_unfiltered53kanon823 1489123Mar 2023Filtered version of the ShareGPT dataset, consisting of real conversations between users and ChatGPT.
lmsys-chat-1m-smortmodelsonly45.8kNebulous, Zheng et al.Sep 2023Filtered version of lmsys-chat-1m with responses from GPT-4, GPT-3.5-turbo, Claude-2, Claude-1, and Claude-instant-1.
Open-Platypus24.9kLee et al.Sep 2023Collection of datasets that were deduplicated using Sentence Transformers (it contains an NC dataset). See Platypus paper.
databricks-dolly-15k15kConover et al.May 2023Generated by Databricks employees, prompt/response pairs in eight different instruction categories, including the seven outlined in the InstructGPT paper.

Math & Logic

LLMs often struggle with mathematical reasoning and formal logic, which has led to the creation of specialized datasets. These datasets extend beyond pure mathematics, encompassing a wide range of problems that require systematic thinking and step-by-step reasoning, ultimately enabling LLMs to tackle complex real-world challenges that involve logical deduction and quantitative analysis.

Dataset#AuthorsDateNotes
OpenMathInstruct-15.75MToshniwal et al.Feb 2024Problems from GSM8K and MATH, solutions generated by Mixtral-8x7B
MetaMathQA395kYu et al.Dec 2023Bootstrap mathematical questions by rewriting them from multiple perspectives. See MetaMath paper.
MathInstruct262kYue et al.Sep 2023Compiled from 13 math rationale datasets, six of which are newly curated, and focuses on chain-of-thought and program-of-thought.
Orca-Math200kMitra et al.Feb 2024Grade school math world problems generated using GPT4-Turbo. See Orca-Math paper.
NuminaMath-CoT859kJia LI et al.2024The sources of the dataset range from Chinese high school math exercises to US and international mathematics olympiad competition problems.

Code

Code is another challenging domain for LLMs that lack specialized pre-training. Code datasets, containing diverse programming language examples, are used to fine-tune LLMs and enhance their ability to understand, generate, and analyze code, enabling them to serve as effective coding assistants.

Dataset#AuthorsDateNotes
CodeFeedback-Filtered-Instruction157kZheng et al.Feb 2024Filtered version of Magicoder-OSS-Instruct, ShareGPT (Python), Magicoder-Evol-Instruct, and Evol-Instruct-Code.
Tested-143k-Python-Alpaca143kVezoraMar 2024Collection of generated Python code that passed automatic tests to ensure high quality.
glaive-code-assistant136kGlaive.aiSep 2023Synthetic data of problems and solutions with ~60% Python samples. Also see the v2 version.
Magicoder-Evol-Instruct-110K110kWei et al.Nov 2023A decontaminated version of evol-codealpaca-v1. Decontamination is done in the same way as StarCoder (bigcode decontamination process). See Magicoder paper.
dolphin-coder109kEric HartfordNov 2023Dataset transformed from leetcode-rosetta.
synthetic_tex_to_sql100kGretel.aiApr 2024Synthetic text-to-SQL samples (~23M tokens), covering diverse domains.
sql-create-context78.6kb-mc2Apr 2023Cleansed and augmented version of the WikiSQL and Spider datasets.
Magicoder-OSS-Instruct-75K75kWei et al.Nov 2023OSS-Instruct dataset generated by gpt-3.5-turbo-1106. See Magicoder paper.
Code-Feedback66.4kZheng et al.Feb 2024Diverse Code Interpreter-like dataset with multi-turn dialogues and interleaved text and code responses. See OpenCodeInterpreter paper.
Open-Critic-GPT55.1kVezoraJul 2024Use a local model to create, introduce, and identify bugs in code across multiple programming languages.
self-oss-instruct-sc2-exec-filter-50k50.7kLozhkov et al.Apr 2024Created in three steps with seed functions from TheStack v1, self-instruction with StarCoder2, and self-validation. See the blog post.

Conversation & Role-Play

Many datasets focus on pairs of instructions and outputs, but chat models are often used in conversational settings. Conversational and role-play datasets expose LLMs to the patterns, nuances, and context-dependent nature of real conversations, allowing them to generate more natural, and engaging dialogues.

Dataset#AuthorsDateNotes
Bluemoon290kSquish42Jun 2023Posts from the Blue Moon roleplaying forum cleaned and scraped by a third party.
PIPPA16.8kGosling et al., kingbriAug 2023Deduped version of Pygmalion's PIPPA in ShareGPT format.
Capybara16kLDJnrDec 2023Strong focus on information diversity across a wide range of domains with multi-turn conversations.
RPGPT_PublicDomain-alpaca4.26kpractical dreamerMay 2023Synthetic dataset of public domain character dialogue in roleplay format made with build-a-dataset
Pure-Dove3.86kLDJnrSep 2023Highly filtered multi-turn conversations between GPT-4 and real humans
Opus Samantha1.85kmacadeliccApr 2024Multi-turn conversations with Claude 3 Opus.
LimaRP-augmented804lemonilia, grimulkanJan 2024Augmented and cleansed version of LimaRP, consisting of human roleplaying conversations.

Multilingual

Learning new languages "from scratch" is a pre-training task, but providing multilingual instruction samples is useful to boost performance in the languages of interest.

Dataset#AuthorsDateNotes
M2Lingual175KServiceNow AIJune 2024Dataset spanning 70+ langauges and 20 NLP tasks generated from GPT-4 using task-based taxonomy guided evolutions. More details in M2Lingual paper.

Agent & Function calling

Function calling allows large language models (LLMs) to execute predefined functions with parameters inferred from user prompts, rather than generating standard text responses. This enables LLMs to seamlessly integrate with external systems, perform complex operations, and provide more accurate and contextually relevant responses.

Dataset#AuthorsDateNotes
glaive-function-calling-v2113kSahil ChaudharySep 2023High-quality dataset with pairs of instructions and answers in different languages. <br>See Locutusque/function-calling-chatml for a variant without conversation tags.
xlam-function-calling-60k60kSalesforceJun 2024Samples created using a data generation pipeline designed to produce verifiable data for function-calling applications
Agent-FLAN34.4kinternlmMar 2024Mix of AgentInstruct, ToolBench, and ShareGPT datasets.

⚖️ Preference alignment

Preference alignment datasets are collections of data used to address the alignment of an LLM's goals with those of its human operators or users. These datasets help LLM's stay consistent with human values and preferences.

Dataset#AuthorsDateNotes
ultrafeedback-binarized-preferences-cleaned158kArgilla2023This dataset has been created to explore whether DPO fine-tuning with more than one rejection per chosen response helps the model perform better in the AlpacaEval, MT-Bench, and LM Eval Harness benchmarks.
ultrafeedback-multi-binarized-preferences-cleaned158kArgilla2023This dataset represents a new iteration on top of argilla/ultrafeedback-binarized-preferences-cleaned, and has been created to explore whether DPO fine-tuning with more than one rejection per chosen response helps the model perform better in the AlpacaEval, MT-Bench, and LM Eval Harness benchmarks.

🔧 Tools

To create a high-quality dataset, focus on carefully curating a diverse set of relevant, accurate, and informative examples rather than simply maximizing dataset size.

Start by aggregating available data from various sources (open-source or not) and applying filters like data deduplication and data quality. If the initial dataset is small or insufficient, consider synthetically generating additional data that mirrors its quality and style. Iteratively explore and refine the dataset by assessing model performance, identifying gaps, and collecting or generating data to address those shortcomings.

Tools listed in this section may belong to several categories but appear in only one for clarity.

Data deduplication and decontamination

Data quality evaluation

Data generation

SFT datasets

Pre-training datasets

Data exploration

Data scraping

Acknowledgments

Special thanks to geronimi73, Bytes-Explorer, euclaise, RishabhMaheshwary, and ParagEkbote for their PRs.

References

Please let me know if a dataset is not properly credited.

Citation

@misc{llm_datasets,
  author = {Labonne, Maxime},
  month = {04},
  title = {{LLM Datasets}},
  url = {[https://github.com/github-linguist/linguist](https://github.com/mlabonne/llm-datasets/)},
  year = {2024}
}