Home

Awesome

Awesome-LLMs-Datasets

Paper

The paper "Datasets for Large Language Models: A Comprehensive Survey" has been released.(2024/2)

Abstract:

This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current lack of a comprehensive overview and thorough analysis of LLM datasets, and to gain insights into their current status and future trends, this survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5) Traditional Natural Language Processing (NLP) Datasets. The survey sheds light on the prevailing challenges and points out potential avenues for future investigation. Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. We aim to present the entire landscape of LLM text datasets, serving as a comprehensive reference for researchers in this field and contributing to future studies.

<p align="center"> <img src="Fig1.jpg" width="800"/> <p> <p align="center"> <strong>Fig 1. The overall architecture of the survey. Zoom in for better view</strong> <p>

Dataset Information Module

The following is a summary of the dataset information module.

Changelog

Table of Contents

Pre-training Corpora

The pre-training corpora are large collections of text data used during the pre-training process of LLMs.

General Pre-training Corpora

The general pre-training corpora are large-scale datasets composed of extensive text from diverse domains and sources. Their primary characteristic is that the text content is not confined to a single domain, making them more suitable for training general foundational models. Corpora are classified based on data categories.

Dataset information format:

- Dataset name  Release Time | Public or Not | Language | Construction Method | Paper | Github | Dataset | Website
  - Publisher:
  - Size:
  - License:
  - Source:

Webpages

Language Texts

Books

Academic Materials

Code <a id="code01"></a>

Parallel Corpus

Social Media

Encyclopedia

Multi-category

Domain-specific Pre-training Corpora

Domain-specific pre-training corpora are LLM datasets customized for specific fields or topics. The type of corpus is typically employed in the incremental pre-training phase of LLMs. Corpora are classified based on data domains.

Dataset information format:

- Dataset name  Release Time | Public or Not | Language | Construction Method | Paper | Github | Dataset | Website
  - Publisher:
  - Size:
  - License:
  - Source:
  - Category:
  - Domain:

Financial <a id="financial01"></a>

Medical <a id="medical01"></a>

Math <a id="math03"></a>

Other <a id="other01"></a>

Instruction Fine-tuning Datasets

The instruction fine-tuning datasets consists of a series of text pairs comprising “instruction inputs” and “answer outputs.” “Instruction inputs” represent requests made by humans to the model. There are various types of instructions, such as classification, summarization, paraphrasing, etc. “Answer outputs” are the responses generated by the model following the instruction and aligning with human expectations.

General Instruction Fine-tuning Datasets

General instruction fine-tuning datasets contain one or more instruction categories with no domain restrictions, primarily aiming to enhance the instruction-following capability of LLMs in general tasks. Datasets are classified based on construction methods.

Dataset information format:

- Dataset name  Release Time | Public or Not | Language | Construction Method | Paper | Github | Dataset | Website
  - Publisher:
  - Size:
  - License:
  - Source:
  - Instruction Category:

Human Generated Datasets (HG)

Model Constructed Datasets (MC)

Collection and Improvement of Existing Datasets (CI)

HG & CI

HG & MC

CI & MC

HG & CI & MC

Domain-specific Instruction Fine-tuning Datasets

The domain-specific instruction fine-tuning datasets are constructed for a particular domain by formulating instructions that encapsulate knowledge and task types closely related to that domain.

Dataset information format:

- Dataset name  Release Time | Public or Not | Language | Construction Method | Paper | Github | Dataset | Website
  - Publisher:
  - Size:
  - License:
  - Source:
  - Instruction Category:
  - Domain:

Medical <a id="medical02"></a>

Code <a id="code02"></a>

Legal

Math <a id="math01"></a>

Education

Other <a id="other02"></a>

Preference Datasets

Preference datasets are collections of instructions that provide preference evaluations for multiple responses to the same instruction input.

Preference Evaluation Methods

The preference evaluation methods for preference datasets can be categorized into voting, sorting, scoring, and other methods. Datasets are classified based on preference evaluation methods.

Dataset information format:

- Dataset name  Release Time | Public or Not | Language | Construction Method | Paper | Github | Dataset | Website
  - Publisher:
  - Size:
  - License:
  - Domain:
  - Instruction Category: 
  - Preference Evaluation Method: 
  - Source: 

Vote

Sort

Score

Other <a id="other03"></a>

Evaluation Datasets

Evaluation datasets are a carefully curated and annotated set of data samples used to assess the performance of LLMs across various tasks.Datasets are classified based on evaluation domains.

Dataset information format:

- Dataset name  Release Time | Public or Not | Language | Construction Method | Paper | Github | Dataset | Website
  - Publisher:
  - Size:
  - License:
  - Question Type: 
  - Evaluation Method: 
  - Focus: 
  - Numbers of Evaluation Categories/Subcategories: 
  - Evaluation Category: 

General

Exam

Subject

NLU

Reasoning

Knowledge

Long Text

Tool

Agent

Code <a id="code03"></a>

OOD

Law

Medical <a id="medical03"></a>

Financial <a id="financial02"></a>

Social Norms

Factuality

Evaluation

Multitask <a id="multitask01"></a>

Multilingual

Other <a id="other04"></a>

Evaluation Platform

Traditional NLP Datasets

Diverging from instruction fine-tuning datasets, we categorize text datasets dedicated to natural language tasks before the widespread adoption of LLMs as traditional NLP datasets.

Dataset information format:

- Dataset name  Release Time | Language | Paper | Github | Dataset | Website
  - Publisher:
  - Train/Dev/Test/All Size: 
  - License:
  - Number of Entity Categories: (NER Task)
  - Number of Relationship Categories: (RE Task)

Question Answering

The task of question-answering requires the model to utilize its knowledge and reasoning capabilities to respond to queries based on provided text (which may be optional) and questions.

Reading Comprehension

The task of reading comprehension entails presenting a model with a designated text passage and associated questions, prompting the model to understand the text for the purpose of answering the questions.

Selection & Judgment
Cloze Test
Answer Extraction
Unrestricted QA

Knowledge QA

In the knowledge QA task, models respond to questions by leveraging world knowledge, common sense, scientific insights, domain-specific information, and more.

Reasoning QA

The focal point of reasoning QA tasks is the requirement for models to apply abilities such as logical reasoning, multi-step inference, and causal reasoning in answering questions.

Recognizing Textual Entailment

The primary objective of tasks related to Recognizing Textual Entailment (RTE) is to assess whether information in one textual segment can be logically inferred from another.

Math <a id="math02"></a>

Mathematical assignments commonly involve standard mathematical calculations, theorem validations, and mathematical reasoning tasks, among others.

Coreference Resolution

The core objective of tasks related to coreference resolution is the identification of referential relationships within texts.

Sentiment Analysis

The sentiment analysis task, commonly known as emotion classification, seeks to analyze and deduce the emotional inclination of provided texts, commonly categorized as positive, negative, or neutral sentiments.

Semantic Matching

The task of semantic matching entails evaluating the semantic similarity or degree of correspondence between two sequences of text.

Text Generation

The narrow definition of text generation tasks is bound by provided content and specific requirements. It involves utilizing benchmark data, such as descriptive terms and triplets, to generate corresponding textual descriptions.

Text Translation

Text translation involves transforming text from one language to another.

Text Summarization

The task of text summarization pertains to the extraction or generation of a brief summary or headline from an extended text to encapsulate its primary content.

Text Classification

Text classification tasks aim to assign various text instances to predefined categories, comprising text data and category labels as pivotal components.

Text Quality Evaluation

The task of text quality evaluation, also referred to as text correction, involves the identification and correction of grammatical, spelling, or language usage errors in text.

Text-to-Code

The Text-to-Code task involves models converting user-provided natural language descriptions into computer-executable code, thereby achieving the desired functionality or operation.

Named Entity Recognition

The Named Entity Recognition (NER) task aims to discern and categorize named entities within a given text.

Relation Extraction

The endeavor of Relation Extraction (RE) necessitates the identification of connections between entities within textual content. This process typically includes recognizing and labeling pertinent entities, followed by the determination of the specific types of relationships that exist among them.

Multitask <a id="multitask02"></a>

Multitask datasets hold significance as they can be concurrently utilized for different categories of NLP tasks.

Multi-modal Large Language Models (MLLMs) Datasets <a id="multi-modal-large-language-models-mllms-datasets"></a>

Pre-training Corpora <a id="mllmpre"></a>

Documents

Instruction Fine-tuning Datasets <a id="instruction02"></a>

Remote Sensing

Images + Videos

Visual Document Understanding

General

Evaluation Datasets <a id="evaluation02"></a>

Video Understanding

Subject

Multitask

Long Input

Factuality

Medical

Image Understanding

Retrieval Augmented Generation (RAG) Datasets <a id="retrieval-augmented-generation-rag-datasets"></a>

Contact

Contact information:

  Lianwen Jin:lianwen.jin@gmail.com

  Yang Liu:ly10061105@gmail.com

Due to our current limited human resources to manage such a vast amount of data resources, we regret that we are unable to include all data resources at this moment. If you find any important data resources that have not yet been included, we warmly invite you to submit relevant papers, data links, and other information to us. We will evaluate them, and if appropriate, we will include the data in the Awesome-LLMs-Datasets and the survey paper. Your assistance and support are greatly appreciated!

Citation

If you wish to cite this project, please use the following citation format:

@article{liu2024survey,
  title={Datasets for Large Language Models: A Comprehensive Survey},
  author={Liu, Yang and Cao, Jiahuan and Liu, Chongyu and Ding, Kai and Jin, Lianwen},
  journal={arXiv preprint arXiv:2402.18041},
  year={2024}
}