Home

Awesome

<p align="center" width="60%"> <img src="LOGO.png" width="40%" height="40%"> </p>

<div align="center">LLMDataHub: Awesome Datasets for LLM Training </div>


<p align="center"> 🔥 <a href="#general_aligment" target="_blank">Alignment Datasets</a> • 💡 <a href="#domain-specific" target="_blank">Domain-specific Datasets</a> • :atom: <a href="#pretrain" target="_blank">Pretraining Datasets</a> 🖼️ <a href="#multimodal" target="_blank">Multimodal Datasets</a> <br> </p> <p align="center"> <img alt="GitHub last commit" src="https://img.shields.io/github/last-commit/Zjh-819/LLMDataHub"> <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/Zjh-819/LLMDataHub"> </p>

Introduction 📄

Large language models (LLMs), such as OpenAI's GPT series, Google's Bard, and Baidu's Wenxin Yiyan, are driving profound technological changes. Recently, with the emergence of open-source large model frameworks like LlaMa and ChatGLM, training an LLM is no longer the exclusive domain of resource-rich companies. Training LLMs by small organizations or individuals has become an important interest in the open-source community, with some notable works including Alpaca, Vicuna, and Luotuo. In addition to large model frameworks, large-scale and high-quality training corpora are also essential for training large language models. Currently, relevant open-source corpora in the community are still scattered. Therefore, the goal of this repository is to continuously collect high-quality training corpora for LLMs in the open-source community.

Training a chatbot LLM that can follow human instruction effectively requires access to high-quality datasets that cover a range of conversation domains and styles. In this repository, we provide a curated collection of datasets specifically designed for chatbot training, including links, size, language, usage, and a brief description of each dataset. Our goal is to make it easier for researchers and practitioners to identify and select the most relevant and useful datasets for their chatbot LLM training needs. Whether you're working on improving chatbot dialogue quality, response generation, or language understanding, this repository has something for you.

Contact 📬 <br/>

If you want to contribute, you can contact:

Junhao Zhao 📧 <br/> Advised by Prof. Wanyun Cui

<div id="general_aligment">General Open Access Datasets for Alignment 🟢:</div>

Type Tags 🏷️:

Datasets released in November 2023

Dataset nameUsed byTypeLanguageSizeDescription ️
helpSteer/RLHFEnglish37k instancesAn RLHF dataset that is annotated by human with helpfulness, correctness, coherence, complexity and verbosity measures
no_robots/SFTEnglish10k instanceHigh-quality human-created STF data, single turn.

Datasets released in September 2023

Dataset nameUsed byTypeLanguageSizeDescription ️
Anthropic_<br/>HH_GoldenULMASFT / RLHFEnglishtrain 42.5k + test 2.3kImproved on the harmless dataset of Anthropic's Helpful and Harmless (HH) datasets. Using GPT4 to rewrite the original "chosen" answer. Compared with the original Harmless dataset, empirically this dataset improves the performance of RLHF, DPO or ULMA methods significantly on harmless metrics.

Datasets released in August 2023

Dataset nameUsed byTypeLanguageSizeDescription ️
function_<br/>calling_<br/>extended/PairsEnglish<br/>code/High quality human created dataset from enhance LM's API using ability.
AmericanStories/PTEnglish/Vast sized corpus scanned from US Library of Congress.
dolmaOLMoPT/3T tokensA large diverse open-source corpus for LM pretraining.
PlatypusPlatypus2PairsEnglish25KA very high quality dataset for improving LM's STEM reasoning ability.
PuffinRedmond-Puffin<br/>SeriesDialogEnglish~3k entriesA dataset consists of conversations between real human and GPT-4,which features long context (over 1k tokens per conversation) and multi-turn dialogs.
tiny series/PairsEnglish/A series of short and concise codes or texts aim at improving LM's reasoning ability.
LongBench/Evaluation<br/>OnlyEnglish<br/>Chinese17 tasksA benchmark for evaluate LLM's long context understanding capability.

Datasets released in July 2023

Dataset nameUsed byTypeLanguageSizeDescription ️
orca-chat/DialogEnglish198,463 entriesAn Orca-style dialog dataset aims at improving LM's long context conversational ability.
DialogStudio/DialogMultilingual/A collection of diverse datasets aim at building conversational Chatbot.
chatbot_arena<br/>_conversations/RLHF<br/>DialogMultilingual33k conversationsCleaned conversations with pairwise human preferences collected on Chatbot Arena.
WebGLM-qaWebGLmPairsEnglish43.6k entriesDataset used by WebGLM, which is a QA system based on LLM and Internet. Each of the entry in this dataset comprise a question, a response and a reference. The response is grounded in the reference.
phi-1phi-1DialogEnglish/A dataset generated by using the method in Textbooks Are All You Need. It focuses on math and CS problems.
Linly-<br/>pretraining-<br/>datasetLinly seriesPTChinese3.4GBChinese pretraining dataset used by Linly series model, comprises ClueCorpusSmall, CSL news-crawl and etc.
FineGrainedRLHF/RLHFEnglish~5K examplesA repo aims at develop a new framework to collect human feedbacks. Data collected is with the purpose to improve LLMs factual correctness, topic relevance and other abilities.
dolphin/PairsEnglish4.5M entriesAn attempt to replicate Microsoft's Orca. Based on FLANv2.
openchat_<br/>sharegpt4_<br/>datasetOpenChatDialogEnglish6k dialogsA high quality dataset generated by using GPT-4 to complete refined ShareGPT prompts.

Datasets released in June 2023

Dataset nameUsed byTypeLanguageSizeDescription ️
OpenOrca/PairsEnglish4.5M completionsA collection of augmented FLAN data. Generated by using method is Orca paper.
COIG-PC <br/> COIG-Lite/PairsChinese/Enhanced version of COIG.
WizardLM_Orcaorca_mini seriesPairsEnglish55K entriesEnhanced WizardLM data. Generated by using orca's method.
arxiv instruct datasets<br/> math <br/> CS <br/> Physics/PairsEnglish50K/<br/>50K/<br/>30K entriesdataset consists of question-answer pairs derived from ArXiv abstracts. Questions are generated using the t5-base model, while the answers are generated using the GPT-3.5-turbo model.
im-feeling-<br/>curious/PairsEnglish2595 entriesRandom questions and correspond facts generated by Google I'm feeling curious features.
ign_clean<br/>_instruct<br/>_dataset_500k/Pairs/509K entriesA large scale SFT dataset which is synthetically created from a subset of Ultrachat prompts. ⚠ lack of detailed datacard
WizardLM<br/>evolve_instruct V2WizardLMDialogEnglish196k entriesThe latest version of Evolve Instruct dataset.
Dynosaur/PairsEnglish800K entriesThe dataset generated by applying method in this paper. Highlight is generating high-quality data at low cost.
SlimPajama/PTPrimarily<br/>English/A cleaned and deduplicated version of RedPajama
LIMA datasetLIMAPairsEnglish1k entriesHigh quality SFT dataset used by LIMA: Less Is More for Alignment
TigerBot SeriesTigerBotPT<br/>PairsChinese<br/>English/Datasets used to train the TigerBot, including pretraining data, STF data and some domain specific datasets like financial research reports.
TSI-v0/PairsEnglish30k examples<br/>per taskA Multi-task instruction-tuning data recasted from 475 of the tasksource datasets. Similar to Flan dataset and Natural instruction.
NMBVC/PTChinese/A large scale, continuously updating Chinese pretraining dataset.
StackOverflow<br/>post/PT/35GBRaw StackOverflow data in markdown format, for pretraining.

Datasets released before June 2023

Dataset nameUsed byTypeLanguageSizeDescription ️
LaMini-Instruction/PairsEnglish2.8M entriesA dataset distilled from flan collection, p3 and self-instruction.
ultraChat/DialogEnglish1.57M dialogsA large scale dialog dataset created by using two ChatGPT, one of which act as the user, another generates response.
ShareGPT_<br/>Vicuna_unfilteredVicunaPairsMultilingual53K entriesCleaned ShareGPT dataset.
pku-saferlhf-datasetBeaverRLHFEnglish10K + 1MThe first dataset of its kind and contains 10k instances with safety preferences.
RefGPT-Dataset<br/>nonofficial linkRefGPTPairs, DialogChinese~50K entriesA Chinese dialog dataset aims at improve the correctness of fact in LLMs (mitigate the hallucination of LLM).
Luotuo-QA-A<br/>CoQA-ChineseLuotuo projectContextChinese127K QA pairsA dataset built upon translated CoQA. Augmented by using OpenAI API.
Wizard-LM-Chinese<br/>instruct-evolLuotuo projectPairsChinese~70K entriesChinese version WizardLM 70K. Answers are obtained by feed translated questions in OpenAI's GPT API and then get responses.
alpaca_chinese<br/>dataset/PairsChinese/GPT-4 translated alpaca data includes some complement data (like Chinese poetry, application, etc.). Inspected by human.
Zhihu-KOLOpen AssistantPairsChinese1.5GBQA data on well-know Chinese Zhihu QA platform.
Alpaca-GPT-4_zh-cn/PairsChineseabout 50K entriesA Chinese Alpaca-style dataset, generated by GPT-4 originally in Chinese, not translated.
hh-rlhf <br/> on HuggingfaceKoalaRLHFEnglish161k pairs<br/>79.3MBA pairwise dataset for training reward models in reinforcement learning for improving language models' harmlessness and helpfulness.
Panther-dataset_v1PantherPairsEnglish377 entriesA dataset comes from the hh-rlhf. It rewrite hh-rlhf into the form of input-output pairs.
Baize DatasetBaizeDialogEnglish100K dialogsA dialog dataset generated by GPT-4 using self-talking. Questions and topics are collected from Quora, StackOverflow and some medical knowledge source.
h2ogpt-fortune2000<br/>personalizedh2ogptPairsEnglish11363 entriesA instruction finetune developed by h2oai, covered various topics.
SHPStableVicuna,<br/>chat-opt,<br/>, SteamSHPRLHFEnglish385K entriesAn RLHF dataset different from previously mentioned ones, it use scores+timestamps to infer the users' preferences. Covers 18 domains, collected by Stanford.
ELI5MiniLM seriesFT,<br/>RLHFEnglish270K entriesQuestions and Answers collected from Reddit, including score. Might be used for RLHF reward model training.
WizardLM<br/>evol_instruct <br/> V2WizardLMPairsEnglishAn instruction finetune dataset derived from Alpaca-52K, using the evolution method in this paper
MOSS SFT dataMOSSPairs,<br/>DialogChinese, English1.1M entriesA conversational dataset collected and developed by MOSS team. It has usefulness, loyalty and harmlessness labels for every data entries.
ShareGPT52KKoala, Stable LLMPairsMultilingual52KThis dataset comprises conversations collected from ShareGPT, with a specific focus on customized creative conversation.
GPT-4all DatasetGPT-4allPairsEnglish, <br/> Might have <br/> a translated version400k entriesA combination of some subsets of OIG, P3 and Stackoverflow. Covers topics like general QA, customized creative questions.
COIG/PairsChinese,<br/>code200K entriesA Chinese-based dataset. It contains domains like general purpose QA, Chinese exams, code. Its quality is checked by human annotators.
RedPajama-Data-1TRedPajamaPTPrimarily English1.2T tokens <br/> 5TBA fully open pretraining dataset follows the LLaMA's method.
OASST1OpenAssistantPairs,<br/> DialogMultilingual<br/>(English, Spanish, etc.)66,497 conversation treesA large, human-written, human-annotated high quality conversation dataset. It aims at making LLM generates more natural response.
Alpaca-COTPhoenixPairs,<br/> Dialog,<br/> CoTEnglish/A mixture a many dataset like classic Alpaca dataset, OIG, Guanaco and some CoT(Chain-of-Thought) datasets like FLAN-CoT. May be handy to use.
Bactrian-X/PairsMultilingual<br/> (52 languages)67K entries per languageA multilingual version of Alpaca and Dolly-15K.
databricks-dolly-15k <br/> zh-cn VerDolly2.0PairsEnglish15K+ entriesA dataset of human-written prompts and responses, featuring tasks such as open-domain question-answering, brainstorming, summarization, and more.
AlpacaDataCleanedSome Alpaca/ LLaMA-like modelsPairsEnglish/Cleaned version of Alpaca, GPT_LLM and GPTeacher.
GPT-4-LLM DatasetSome Alpaca-like modelsPairs,<br/> RLHFEnglish,<br/> Chinese52K entries for English and Chinese respectively <br/> 9K entries unnatural-instructionNOT the dataset used by GPT-4!! It is generated by GPT-4 and some other LLM for better Pairs and RLHF. It includes instruction data as well as comparison data in RLHF style.
GPTeacher/PairsEnglish20k entriesA dataset contains targets generated by GPT-4 and includes many of the same seed tasks as the Alpaca dataset, with the addition of some new tasks such as roleplay.
HC3KoalaRLHFEnglish,<br/> Chinese24322 English <br/> 12853 ChineseA multi-domain, human-vs-ChatGPT comparison dataset. Can be used for reward model training or ChatGPT detector training.
Alpaca data <br/> DownloadAlpaca, ChatGLM-finetune-LoRA, KoalaDialog,<br/> PairsEnglish52K entries<br/>21.4MBA dataset generated by text-davinci-003 to improve language models' ability to follow human instruction.
OIG <br/> OIG-small-chip2Pythia-Chat-Base-7B, GPT-NeoXT-Chat-Base-20B, KoalaDialog,<br/> PairsEnglish,<br/> code44M entriesA large conversational instruction dataset with medium and high quality subsets (OIG-small-chip2) for multi-task learning.
ChatAlpaca data/Dialog,<br/> PairsEnglish,<br/> Chinese version coming soon10k entries<br/>39.5MBA dataset aims to help researchers develop models for instruction-following in multi-turn conversations.
InstructionWildColossalChatPairsEnglish, Chinese10K enreuesA Alpaca-style dataset, but with seed tasks comes from chatgpt screenshot.
Firefly(流萤)Firefly(流萤)PairsChinese1.1M entries<br/>1.17GBA Chinese instruction-tuning dataset with 1.1 million human-written examples across 23 tasks, but no conversation.
BELLE <br/> 0.5M version <br/> 1M version <br/> 2M versionBELLE series, Chunhua (春华)PairsChinese2.67B in totalA Chinese instruction dataset similar to Alpaca data constructed by generating answers from seed tasks, but no conversation.
GuanacoDatasetGuanacoDialog,<br/> PairsEnglish,<br/> Chinese,<br/> Japanese534,530 entriesA multilingual instruction dataset for enhancing language models' capabilities in various linguistic tasks, such as natural language understanding and explicit content recognition.
OpenAI WebGPTWebGPT's reward model, KoalaRLHFEnglish19,578 pairsData set used in WebGPT paper. Used for training reward model in RLHF.
OpenAI<br/>Summarization<br/>ComparisonKoalaRLHFEnglish~93K entries<br/>420MBA dataset of human feedback which helps training a reward model. The reward model was then used to train a summarization model to align with human preferences.
self-instruct/PairsEnglish82K entriesThe dataset generated by using the well-known self-instruction method
unnatural-instructions/PairsEnglish240,670 examplesAn early attempt to use powerful model (text-davinci-002) to generate data.
xP3 (and some variant)BLOOMZ, mT0PairsMultilingual,<br/> code79M entries<br/>88GBAn instruction dataset for improving language models' generalization ability, similar to Natural Instruct.
Flan V2//English/A dataset compiles datasets from Flan 2021, P3, Super-Natural Instructions, along with dozens more datasets into one and formats them into a mix of zero-shot, few-shot and chain-of-thought templates
Natural Instruction <br/> GitHub&Downloadtk-instruct seriesPairs, <br/> evaluationMultilingual/A benchmark with over 1,600 tasks with instruction and definition for evaluating and improving language models' multi-task generalization under natural language instruction.
CrossWOZ/DialogEnglish,<br/>Chinese6K dialogsThe dataset introduced by this paper, mainly about tourism topic in Beijing, answers are generated automatically by rules.

Potential Overlaps ⚠️

We consider row items as subject.

OIGhh-rlhfxP3natural instructAlpacaDataCleanedGPT-4-LLMAlpaca-CoT
OIG/containsoverlapoverlapoverlapoverlap
hh-rlhfpart of/overlap
xP3overlap/overlapoverlap
natural instructoverlapoverlap/overlap
AlpacaDataCleanedoverlap/overlapoverlap
GPT-4-LLMoverlap/overlap
Alpaca-CoToverlapoverlapoverlapoverlapoverlapoverlap/

<div id="pretrain">Open Datasets for Pretraining 🟢 :atom:</div>

Dataset nameUsed byTypeLanguageSizeDescription ️
proof-pileproof-GPTPTEnglish<br/>LaTeX13GBA pretraining dataset which is similar to the pile but have LaTeX corpus to enhance LM's ability in proof.
peS2o/PTEnglish7.5GBA high quality academic paper dataset for pretraining.
StackOverflow<br/>post/PT/35GBRaw StackOverflow data in markdown format, for pretraining.
SlimPajama/PTPrimarily<br/>English/A cleaned and deduplicated version of RedPajama
NMBVC/PTChinese/A large scale, continuously updating Chinese pretraining dataset.
falcon-refinedwebtiiuae/falcon seriesPTEnglish/A refined subset of CommonCrawl.
CBook-150K/PT, <br/> building datasetChinese150K+ booksA raw Chinese books dataset. Need some preprocess pipeline.
Common CrawlLLaMA (After some process)building datasets, <br/> PT//The most well-known raw dataset, rarely be used directly. One possible preprocess pipeline is CCNet
nlp_Chinese_Corpus/PT,<br/>TFChinese/A Chinese pretrain corpus. Includes Wikipedia, Baidu Baike, Baidu QA, some forums QA and news corpus.
The Pile (V1)GLM (partly), LLaMA (partly), GPT-J, GPT-NeoX-20B, Cerebras-GPT 6.7B, OPT-175bPTMultilingual,<br/> code825GBA diverse open-source language modeling dataset consisting of 22 smaller, high-quality datasets that includes many domains and tasks.
C4 <br/> Huggingface dataset <br/> TensorFlow datasetGoogle T5 Series, LLaMAPTEnglish305GBA colossal, cleaned version of Common Crawl's web crawl corpus. Frequently be used.
ROOTSBLOOMPTMultilingual,<br/> code1.6TBA diverse open-source dataset consisting of sub-datasets like Wikipedia and StackExchange for language modeling.
PushshPairs reddit <br/> paperOPT-175bPT//Raw reddit data, one possible processing pipeline in this paper
Gutenberg projectLLaMAPTMultilingual/A book dataset, mostly novels. Not be preprocessed.
CLUECorpus/PT, <br/> finetune, <br/> evaluationChinese100GBA Chinese pretraining Corpus sourced from Common Crawl.

<div id="domain-specific">Domain-specific Datasets 🟢 💡</div>

Dataset nameUsed byTypeLanguageSizeDescription ️
starcoderdatastarcoder<br/>seriesPTcode783GBA large pretraining dataset for improving LM's coding ability.
code_<br/>instructions<br/>_120k_alpaca/PairsEnglish/code121,959 entriescode_instruction in instruction finetune format.
function-<br/>invocations-25ksome MPT <br/> variantsPairsEnglish code25K entriesA dataset aims at teaching AI models how to correctly invoke APIsGuru functions based on natural language prompts.
TheoremQA/PairsEnglish800A high quality STEM theorm QA dataset.
phi-1phi-1DialogEnglish/A dataset generated by using the method in Textbooks Are All You Need. It focuses on math and CS problems.
FinNLPFinGPTRaw dataEnglish,<br/>Chinese/Open-source raw financial text data. Includes news, social media and etc.
PRM800KA variant of<br/>GPT-4ContextEnglish800K entriesA process supervision dataset for mathematical problems
MeChat data ⚠️use with careMeChatDialogChinese355733 utterancesA Chinese SFT dataset for training a mental healthcare chatbot.
ChatGPT-Jailbreak-Prompts <br/> ⚠️RISKY//English163KB file sizePrompts for bypassing the safety regulation of ChatGPT. Can be use for probing the harmlessness of LLMs
awesome chinese<br/>legal resourcesLaWGPT/Chinese/A collection of Chinese legal data for LLM training.
Long Form/PairsEnglish23.7K entriesA dataset aims at improving the long text generation ability of LLM.
symbolic-instruction-tuning/PairsEnglish,<br/> code796A dataset focuses on the 'symbolic' tasks: like SQL coding, mathematical computation, etc.
Safety Prompt/Evaluation onlyChinese100k entriesChinese safety prompts for evaluating and improving the safety of LLMs.
Tapir-Cleaned/PairsEnglish,116k entriesThis is a revised version of the DAISLab dataset of PairsTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning
instructional_<br/>codesearchnet_python/PairsEnglish &<br/> Python192MBThis dataset is a template generated instructional Python datastet generated from an annotated version of the code-search-net dataset for the Open-Assistant project.
finance-alpaca/PairsEnglish1.3K entriesAn Alpaca-style dataset but focus on financial topics

<div id="multimodal">Multimodal Datasets for VLM </div>

Dataset nameUsed byTypeLanguageSizeDescription ️
ShareGPT4V/image-prompt-captionEnglish1.2M instancesA set of GPT4-Vision-powered multi-modal captions data.
OBELICSidefics<br/>seriesimage-documentEnglish141M documentsan open, massive, and curated collection of interleaved image-text web documents.
JourneyDB/image-prompt-captionEnglish4M instancesA large scale dataset comprises QA, caption, and text prompting tasks, which is based on Midjourney images.
M3ITYing-VLMinstruction-imageMultilingual2.4M instancesA dataset comprises 40 tasks with 400 human written instruction.
MIMIC-ITOtterinstruction-imageMultilingial2.2M instancesHigh quality multi-modal instructions-response pairs based on images and videos.
LLaVA InstructionLLaVAinstruction-imageEnglish158k samplesA multimodal dataset generated upon COCO dataset by prompting GPT-4 to get instructions.

Private Datasets 🔴

Dataset nameUsed byTypeLanguageSizeDescription ️
WebText(Reddit links)GPT-2PTEnglish/Data crawled from Reddit and filtered for GPT-2 pretraining.
MassiveTextGopher, ChinchillaPT99% English, 1% other(including code)
WuDao(悟道) CorporaGLMPTChinese200GBA large scale Chinese corpus, Possible component originally open-sourced but not available now.