Awesome

<div align="center">LLMDataHub: Awesome Datasets for LLM Training </div>

<p align="center"> 🔥 <a href="#general_aligment" target="_blank">Alignment Datasets</a> • 💡 <a href="#domain-specific" target="_blank">Domain-specific Datasets</a> • :atom: <a href="#pretrain" target="_blank">Pretraining Datasets</a> 🖼️ <a href="#multimodal" target="_blank">Multimodal Datasets</a> <br> </p> <p align="center"> <img alt="GitHub last commit" src="https://img.shields.io/github/last-commit/Zjh-819/LLMDataHub"> <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/Zjh-819/LLMDataHub"> </p>

Introduction 📄

Large language models (LLMs), such as OpenAI's GPT series, Google's Bard, and Baidu's Wenxin Yiyan, are driving profound technological changes. Recently, with the emergence of open-source large model frameworks like LlaMa and ChatGLM, training an LLM is no longer the exclusive domain of resource-rich companies. Training LLMs by small organizations or individuals has become an important interest in the open-source community, with some notable works including Alpaca, Vicuna, and Luotuo. In addition to large model frameworks, large-scale and high-quality training corpora are also essential for training large language models. Currently, relevant open-source corpora in the community are still scattered. Therefore, the goal of this repository is to continuously collect high-quality training corpora for LLMs in the open-source community.

Training a chatbot LLM that can follow human instruction effectively requires access to high-quality datasets that cover a range of conversation domains and styles. In this repository, we provide a curated collection of datasets specifically designed for chatbot training, including links, size, language, usage, and a brief description of each dataset. Our goal is to make it easier for researchers and practitioners to identify and select the most relevant and useful datasets for their chatbot LLM training needs. Whether you're working on improving chatbot dialogue quality, response generation, or language understanding, this repository has something for you.

Contact 📬 <br/>

If you want to contribute, you can contact:

Junhao Zhao 📧 <br/> Advised by Prof. Wanyun Cui

<div id="general_aligment">General Open Access Datasets for Alignment 🟢:</div>

Type Tags 🏷️:

SFT: Supervised Finetune
- Dialog: Each entry contains continuous conversations
- Pairs: Each entry is an input-output pair
- Context: Each entry has a context text and related QA pairs
PT: pretrain
CoT: Chain-of-Thought Finetune
RLHF: train reward model in Reinforcement Learning with Human Feedback

Datasets released in November 2023

Dataset name	Used by	Type	Language	Size	Description ️
helpSteer	/	RLHF	English	37k instances	An RLHF dataset that is annotated by human with helpfulness, correctness, coherence, complexity and verbosity measures
no_robots	/	SFT	English	10k instance	High-quality human-created STF data, single turn.

Datasets released in September 2023

Dataset name	Used by	Type	Language	Size	Description ️
Anthropic_<br/>HH_Golden	ULMA	SFT / RLHF	English	train 42.5k + test 2.3k	Improved on the harmless dataset of Anthropic's Helpful and Harmless (HH) datasets. Using GPT4 to rewrite the original "chosen" answer. Compared with the original Harmless dataset, empirically this dataset improves the performance of RLHF, DPO or ULMA methods significantly on harmless metrics.

Datasets released in August 2023

Dataset name	Used by	Type	Language	Size	Description ️
function_<br/>calling_<br/>extended	/	Pairs	English<br/>code	/	High quality human created dataset from enhance LM's API using ability.
AmericanStories	/	PT	English	/	Vast sized corpus scanned from US Library of Congress.
dolma	OLMo	PT	/	3T tokens	A large diverse open-source corpus for LM pretraining.
Platypus	Platypus2	Pairs	English	25K	A very high quality dataset for improving LM's STEM reasoning ability.
Puffin	Redmond-Puffin<br/>Series	Dialog	English	~3k entries	A dataset consists of conversations between real human and GPT-4，which features long context (over 1k tokens per conversation) and multi-turn dialogs.
tiny series	/	Pairs	English	/	A series of short and concise codes or texts aim at improving LM's reasoning ability.
LongBench	/	Evaluation<br/>Only	English<br/>Chinese	17 tasks	A benchmark for evaluate LLM's long context understanding capability.

Datasets released in July 2023

Dataset name	Used by	Type	Language	Size	Description ️
orca-chat	/	Dialog	English	198,463 entries	An Orca-style dialog dataset aims at improving LM's long context conversational ability.
DialogStudio	/	Dialog	Multilingual	/	A collection of diverse datasets aim at building conversational Chatbot.
chatbot_arena<br/>_conversations	/	RLHF<br/>Dialog	Multilingual	33k conversations	Cleaned conversations with pairwise human preferences collected on Chatbot Arena.
WebGLM-qa	WebGLm	Pairs	English	43.6k entries	Dataset used by WebGLM, which is a QA system based on LLM and Internet. Each of the entry in this dataset comprise a question, a response and a reference. The response is grounded in the reference.
phi-1	phi-1	Dialog	English	/	A dataset generated by using the method in Textbooks Are All You Need. It focuses on math and CS problems.
Linly-<br/>pretraining-<br/>dataset	Linly series	PT	Chinese	3.4GB	Chinese pretraining dataset used by Linly series model, comprises ClueCorpusSmall, CSL news-crawl and etc.
FineGrainedRLHF	/	RLHF	English	~5K examples	A repo aims at develop a new framework to collect human feedbacks. Data collected is with the purpose to improve LLMs factual correctness, topic relevance and other abilities.
dolphin	/	Pairs	English	4.5M entries	An attempt to replicate Microsoft's Orca. Based on FLANv2.
openchat_<br/>sharegpt4_<br/>dataset	OpenChat	Dialog	English	6k dialogs	A high quality dataset generated by using GPT-4 to complete refined ShareGPT prompts.

Datasets released in June 2023

Dataset name	Used by	Type	Language	Size	Description ️
OpenOrca	/	Pairs	English	4.5M completions	A collection of augmented FLAN data. Generated by using method is Orca paper.
COIG-PC <br/> COIG-Lite	/	Pairs	Chinese	/	Enhanced version of COIG.
WizardLM_Orca	orca_mini series	Pairs	English	55K entries	Enhanced WizardLM data. Generated by using orca's method.
arxiv instruct datasets<br/> math <br/> CS <br/> Physics	/	Pairs	English	50K/<br/>50K/<br/>30K entries	dataset consists of question-answer pairs derived from ArXiv abstracts. Questions are generated using the t5-base model, while the answers are generated using the GPT-3.5-turbo model.
im-feeling-<br/>curious	/	Pairs	English	2595 entries	Random questions and correspond facts generated by Google I'm feeling curious features.
ign_clean<br/>_instruct<br/>_dataset_500k	/	Pairs	/	509K entries	A large scale SFT dataset which is synthetically created from a subset of Ultrachat prompts. ⚠ lack of detailed datacard
WizardLM<br/>evolve_instruct V2	WizardLM	Dialog	English	196k entries	The latest version of Evolve Instruct dataset.
Dynosaur	/	Pairs	English	800K entries	The dataset generated by applying method in this paper. Highlight is generating high-quality data at low cost.
SlimPajama	/	PT	Primarily<br/>English	/	A cleaned and deduplicated version of RedPajama
LIMA dataset	LIMA	Pairs	English	1k entries	High quality SFT dataset used by LIMA: Less Is More for Alignment
TigerBot Series	TigerBot	PT<br/>Pairs	Chinese<br/>English	/	Datasets used to train the TigerBot, including pretraining data, STF data and some domain specific datasets like financial research reports.
TSI-v0	/	Pairs	English	30k examples<br/>per task	A Multi-task instruction-tuning data recasted from 475 of the tasksource datasets. Similar to Flan dataset and Natural instruction.
NMBVC	/	PT	Chinese	/	A large scale, continuously updating Chinese pretraining dataset.
StackOverflow<br/>post	/	PT	/	35GB	Raw StackOverflow data in markdown format, for pretraining.

Datasets released before June 2023

Dataset name	Used by	Type	Language	Size	Description ️
LaMini-Instruction	/	Pairs	English	2.8M entries	A dataset distilled from flan collection, p3 and self-instruction.
ultraChat	/	Dialog	English	1.57M dialogs	A large scale dialog dataset created by using two ChatGPT, one of which act as the user, another generates response.
ShareGPT_<br/>Vicuna_unfiltered	Vicuna	Pairs	Multilingual	53K entries	Cleaned ShareGPT dataset.
pku-saferlhf-dataset	Beaver	RLHF	English	10K + 1M	The first dataset of its kind and contains 10k instances with safety preferences.
RefGPT-Dataset<br/>nonofficial link	RefGPT	Pairs, Dialog	Chinese	~50K entries	A Chinese dialog dataset aims at improve the correctness of fact in LLMs (mitigate the hallucination of LLM).
Luotuo-QA-A<br/>CoQA-Chinese	Luotuo project	Context	Chinese	127K QA pairs	A dataset built upon translated CoQA. Augmented by using OpenAI API.
Wizard-LM-Chinese<br/>instruct-evol	Luotuo project	Pairs	Chinese	~70K entries	Chinese version WizardLM 70K. Answers are obtained by feed translated questions in OpenAI's GPT API and then get responses.
alpaca_chinese<br/>dataset	/	Pairs	Chinese	/	GPT-4 translated alpaca data includes some complement data (like Chinese poetry, application, etc.). Inspected by human.
Zhihu-KOL	Open Assistant	Pairs	Chinese	1.5GB	QA data on well-know Chinese Zhihu QA platform.
Alpaca-GPT-4_zh-cn	/	Pairs	Chinese	about 50K entries	A Chinese Alpaca-style dataset, generated by GPT-4 originally in Chinese, not translated.
hh-rlhf <br/> on Huggingface	Koala	RLHF	English	161k pairs<br/>79.3MB	A pairwise dataset for training reward models in reinforcement learning for improving language models' harmlessness and helpfulness.
Panther-dataset_v1	Panther	Pairs	English	377 entries	A dataset comes from the hh-rlhf. It rewrite hh-rlhf into the form of input-output pairs.
Baize Dataset	Baize	Dialog	English	100K dialogs	A dialog dataset generated by GPT-4 using self-talking. Questions and topics are collected from Quora, StackOverflow and some medical knowledge source.
h2ogpt-fortune2000<br/>personalized	h2ogpt	Pairs	English	11363 entries	A instruction finetune developed by h2oai, covered various topics.
SHP	StableVicuna,<br/>chat-opt,<br/>, SteamSHP	RLHF	English	385K entries	An RLHF dataset different from previously mentioned ones, it use scores+timestamps to infer the users' preferences. Covers 18 domains, collected by Stanford.
ELI5	MiniLM series	FT,<br/>RLHF	English	270K entries	Questions and Answers collected from Reddit, including score. Might be used for RLHF reward model training.
WizardLM<br/>evol_instruct <br/> V2	WizardLM	Pairs	English		An instruction finetune dataset derived from Alpaca-52K, using the evolution method in this paper
MOSS SFT data	MOSS	Pairs,<br/>Dialog	Chinese, English	1.1M entries	A conversational dataset collected and developed by MOSS team. It has usefulness, loyalty and harmlessness labels for every data entries.
ShareGPT52K	Koala, Stable LLM	Pairs	Multilingual	52K	This dataset comprises conversations collected from ShareGPT, with a specific focus on customized creative conversation.
GPT-4all Dataset	GPT-4all	Pairs	English, <br/> Might have <br/> a translated version	400k entries	A combination of some subsets of OIG, P3 and Stackoverflow. Covers topics like general QA, customized creative questions.
COIG	/	Pairs	Chinese,<br/>code	200K entries	A Chinese-based dataset. It contains domains like general purpose QA, Chinese exams, code. Its quality is checked by human annotators.
RedPajama-Data-1T	RedPajama	PT	Primarily English	1.2T tokens <br/> 5TB	A fully open pretraining dataset follows the LLaMA's method.
OASST1	OpenAssistant	Pairs,<br/> Dialog	Multilingual<br/>(English, Spanish, etc.)	66,497 conversation trees	A large, human-written, human-annotated high quality conversation dataset. It aims at making LLM generates more natural response.
Alpaca-COT	Phoenix	Pairs,<br/> Dialog,<br/> CoT	English	/	A mixture a many dataset like classic Alpaca dataset, OIG, Guanaco and some CoT(Chain-of-Thought) datasets like FLAN-CoT. May be handy to use.
Bactrian-X	/	Pairs	Multilingual<br/> (52 languages)	67K entries per language	A multilingual version of Alpaca and Dolly-15K.
databricks-dolly-15k <br/> zh-cn Ver	Dolly2.0	Pairs	English	15K+ entries	A dataset of human-written prompts and responses, featuring tasks such as open-domain question-answering, brainstorming, summarization, and more.
AlpacaDataCleaned	Some Alpaca/ LLaMA-like models	Pairs	English	/	Cleaned version of Alpaca, GPT_LLM and GPTeacher.
GPT-4-LLM Dataset	Some Alpaca-like models	Pairs,<br/> RLHF	English,<br/> Chinese	52K entries for English and Chinese respectively <br/> 9K entries unnatural-instruction	NOT the dataset used by GPT-4!! It is generated by GPT-4 and some other LLM for better Pairs and RLHF. It includes instruction data as well as comparison data in RLHF style.
GPTeacher	/	Pairs	English	20k entries	A dataset contains targets generated by GPT-4 and includes many of the same seed tasks as the Alpaca dataset, with the addition of some new tasks such as roleplay.
HC3	Koala	RLHF	English,<br/> Chinese	24322 English <br/> 12853 Chinese	A multi-domain, human-vs-ChatGPT comparison dataset. Can be used for reward model training or ChatGPT detector training.
Alpaca data <br/> Download	Alpaca, ChatGLM-finetune-LoRA, Koala	Dialog,<br/> Pairs	English	52K entries<br/>21.4MB	A dataset generated by text-davinci-003 to improve language models' ability to follow human instruction.
OIG <br/> OIG-small-chip2	Pythia-Chat-Base-7B, GPT-NeoXT-Chat-Base-20B, Koala	Dialog,<br/> Pairs	English,<br/> code	44M entries	A large conversational instruction dataset with medium and high quality subsets (OIG-small-chip2) for multi-task learning.
ChatAlpaca data	/	Dialog,<br/> Pairs	English,<br/> Chinese version coming soon	10k entries<br/>39.5MB	A dataset aims to help researchers develop models for instruction-following in multi-turn conversations.
InstructionWild	ColossalChat	Pairs	English, Chinese	10K enreues	A Alpaca-style dataset, but with seed tasks comes from chatgpt screenshot.
Firefly(流萤)	Firefly(流萤)	Pairs	Chinese	1.1M entries<br/>1.17GB	A Chinese instruction-tuning dataset with 1.1 million human-written examples across 23 tasks, but no conversation.
BELLE <br/> 0.5M version <br/> 1M version <br/> 2M version	BELLE series, Chunhua (春华)	Pairs	Chinese	2.67B in total	A Chinese instruction dataset similar to Alpaca data constructed by generating answers from seed tasks, but no conversation.
GuanacoDataset	Guanaco	Dialog,<br/> Pairs	English,<br/> Chinese,<br/> Japanese	534,530 entries	A multilingual instruction dataset for enhancing language models' capabilities in various linguistic tasks, such as natural language understanding and explicit content recognition.
OpenAI WebGPT	WebGPT's reward model, Koala	RLHF	English	19,578 pairs	Data set used in WebGPT paper. Used for training reward model in RLHF.
OpenAI<br/>Summarization<br/>Comparison	Koala	RLHF	English	~93K entries<br/>420MB	A dataset of human feedback which helps training a reward model. The reward model was then used to train a summarization model to align with human preferences.
self-instruct	/	Pairs	English	82K entries	The dataset generated by using the well-known self-instruction method
unnatural-instructions	/	Pairs	English	240,670 examples	An early attempt to use powerful model (text-davinci-002) to generate data.
xP3 (and some variant)	BLOOMZ, mT0	Pairs	Multilingual,<br/> code	79M entries<br/>88GB	An instruction dataset for improving language models' generalization ability, similar to Natural Instruct.
Flan V2	/	/	English	/	A dataset compiles datasets from Flan 2021, P3, Super-Natural Instructions, along with dozens more datasets into one and formats them into a mix of zero-shot, few-shot and chain-of-thought templates
Natural Instruction <br/> GitHub&Download	tk-instruct series	Pairs, <br/> evaluation	Multilingual	/	A benchmark with over 1,600 tasks with instruction and definition for evaluating and improving language models' multi-task generalization under natural language instruction.
CrossWOZ	/	Dialog	English,<br/>Chinese	6K dialogs	The dataset introduced by this paper, mainly about tourism topic in Beijing, answers are generated automatically by rules.

Potential Overlaps ⚠️

We consider row items as subject.

	OIG	hh-rlhf	xP3	natural instruct	AlpacaDataCleaned	GPT-4-LLM	Alpaca-CoT
OIG	/	contains	overlap	overlap	overlap		overlap
hh-rlhf	part of	/					overlap
xP3	overlap		/	overlap			overlap
natural instruct	overlap		overlap	/			overlap
AlpacaDataCleaned	overlap				/	overlap	overlap
GPT-4-LLM					overlap	/	overlap
Alpaca-CoT	overlap	overlap	overlap	overlap	overlap	overlap	/

<div id="pretrain">Open Datasets for Pretraining 🟢 :atom:</div>

Dataset name	Used by	Type	Language	Size	Description ️
proof-pile	proof-GPT	PT	English<br/>LaTeX	13GB	A pretraining dataset which is similar to the pile but have LaTeX corpus to enhance LM's ability in proof.
peS2o	/	PT	English	7.5GB	A high quality academic paper dataset for pretraining.
StackOverflow<br/>post	/	PT	/	35GB	Raw StackOverflow data in markdown format, for pretraining.
SlimPajama	/	PT	Primarily<br/>English	/	A cleaned and deduplicated version of RedPajama
NMBVC	/	PT	Chinese	/	A large scale, continuously updating Chinese pretraining dataset.
falcon-refinedweb	tiiuae/falcon series	PT	English	/	A refined subset of CommonCrawl.
CBook-150K	/	PT, <br/> building dataset	Chinese	150K+ books	A raw Chinese books dataset. Need some preprocess pipeline.
Common Crawl	LLaMA (After some process)	building datasets, <br/> PT	/	/	The most well-known raw dataset, rarely be used directly. One possible preprocess pipeline is CCNet
nlp_Chinese_Corpus	/	PT,<br/>TF	Chinese	/	A Chinese pretrain corpus. Includes Wikipedia, Baidu Baike, Baidu QA, some forums QA and news corpus.
The Pile (V1)	GLM (partly), LLaMA (partly), GPT-J, GPT-NeoX-20B, Cerebras-GPT 6.7B, OPT-175b	PT	Multilingual,<br/> code	825GB	A diverse open-source language modeling dataset consisting of 22 smaller, high-quality datasets that includes many domains and tasks.
C4 <br/> Huggingface dataset <br/> TensorFlow dataset	Google T5 Series, LLaMA	PT	English	305GB	A colossal, cleaned version of Common Crawl's web crawl corpus. Frequently be used.
ROOTS	BLOOM	PT	Multilingual,<br/> code	1.6TB	A diverse open-source dataset consisting of sub-datasets like Wikipedia and StackExchange for language modeling.
PushshPairs reddit <br/> paper	OPT-175b	PT	/	/	Raw reddit data, one possible processing pipeline in this paper
Gutenberg project	LLaMA	PT	Multilingual	/	A book dataset, mostly novels. Not be preprocessed.
CLUECorpus	/	PT, <br/> finetune, <br/> evaluation	Chinese	100GB	A Chinese pretraining Corpus sourced from Common Crawl.

<div id="domain-specific">Domain-specific Datasets 🟢 💡</div>

Dataset name	Used by	Type	Language	Size	Description ️
starcoderdata	starcoder<br/>series	PT	code	783GB	A large pretraining dataset for improving LM's coding ability.
code_<br/>instructions<br/>_120k_alpaca	/	Pairs	English/code	121,959 entries	code_instruction in instruction finetune format.
function-<br/>invocations-25k	some MPT <br/> variants	Pairs	English code	25K entries	A dataset aims at teaching AI models how to correctly invoke APIsGuru functions based on natural language prompts.
TheoremQA	/	Pairs	English	800	A high quality STEM theorm QA dataset.
phi-1	phi-1	Dialog	English	/	A dataset generated by using the method in Textbooks Are All You Need. It focuses on math and CS problems.
FinNLP	FinGPT	Raw data	English,<br/>Chinese	/	Open-source raw financial text data. Includes news, social media and etc.
PRM800K	A variant of<br/>GPT-4	Context	English	800K entries	A process supervision dataset for mathematical problems
MeChat data ⚠️use with care	MeChat	Dialog	Chinese	355733 utterances	A Chinese SFT dataset for training a mental healthcare chatbot.
ChatGPT-Jailbreak-Prompts <br/> ⚠️RISKY	/	/	English	163KB file size	Prompts for bypassing the safety regulation of ChatGPT. Can be use for probing the harmlessness of LLMs
awesome chinese<br/>legal resources	LaWGPT	/	Chinese	/	A collection of Chinese legal data for LLM training.
Long Form	/	Pairs	English	23.7K entries	A dataset aims at improving the long text generation ability of LLM.
symbolic-instruction-tuning	/	Pairs	English,<br/> code	796	A dataset focuses on the 'symbolic' tasks: like SQL coding, mathematical computation, etc.
Safety Prompt	/	Evaluation only	Chinese	100k entries	Chinese safety prompts for evaluating and improving the safety of LLMs.
Tapir-Cleaned	/	Pairs	English,	116k entries	This is a revised version of the DAISLab dataset of PairsTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning
instructional_<br/>codesearchnet_python	/	Pairs	English &<br/> Python	192MB	This dataset is a template generated instructional Python datastet generated from an annotated version of the code-search-net dataset for the Open-Assistant project.
finance-alpaca	/	Pairs	English	1.3K entries	An Alpaca-style dataset but focus on financial topics

<div id="multimodal">Multimodal Datasets for VLM </div>

Dataset name	Used by	Type	Language	Size	Description ️
ShareGPT4V	/	image-prompt-caption	English	1.2M instances	A set of GPT4-Vision-powered multi-modal captions data.
OBELICS	idefics<br/>series	image-document	English	141M documents	an open, massive, and curated collection of interleaved image-text web documents.
JourneyDB	/	image-prompt-caption	English	4M instances	A large scale dataset comprises QA, caption, and text prompting tasks, which is based on Midjourney images.
M3IT	Ying-VLM	instruction-image	Multilingual	2.4M instances	A dataset comprises 40 tasks with 400 human written instruction.
MIMIC-IT	Otter	instruction-image	Multilingial	2.2M instances	High quality multi-modal instructions-response pairs based on images and videos.
LLaVA Instruction	LLaVA	instruction-image	English	158k samples	A multimodal dataset generated upon COCO dataset by prompting GPT-4 to get instructions.

Private Datasets 🔴

Dataset name	Used by	Type	Language	Size	Description ️
WebText(Reddit links)	GPT-2	PT	English	/	Data crawled from Reddit and filtered for GPT-2 pretraining.
MassiveText	Gopher, Chinchilla	PT	99% English, 1% other(including code)
WuDao(悟道) Corpora	GLM	PT	Chinese	200GB	A large scale Chinese corpus, Possible component originally open-sourced but not available now.