Home

Awesome

<h1 align="center">🌊Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models</h1> <p align="center"><i>Fantastic Data Engineering for Large Language Models</i></p> <div align="center"> <a href="https://github.com/yuleiqin/fantastic-data-engineering/stargazers"><img src="https://img.shields.io/github/stars/yuleiqin/fantastic-data-engineering" alt="Stars Badge"/></a> <a href="https://github.com/yuleiqin/fantastic-data-engineering/network/members"><img src="https://img.shields.io/github/forks/yuleiqin/fantastic-data-engineering" alt="Forks Badge"/></a> <a href="https://github.com/yuleiqin/fantastic-data-engineering/pulls"><img src="https://img.shields.io/github/issues-pr/yuleiqin/fantastic-data-engineering" alt="Pull Requests Badge"/></a> <a href="https://github.com/yuleiqin/fantastic-data-engineering/issues"><img src="https://img.shields.io/github/issues/yuleiqin/fantastic-data-engineering" alt="Issues Badge"/></a> <a href="https://github.com/yuleiqin/fantastic-data-engineering/graphs/contributors"><img alt="GitHub contributors" src="https://img.shields.io/github/contributors/yuleiqin/fantastic-data-engineering?color=2b9348"></a> <a href="https://github.com/yuleiqin/fantastic-data-engineering/blob/master/LICENSE"><img src="https://img.shields.io/github/license/yuleiqin/fantastic-data-engineering?color=2b9348" alt="License Badge"/></a> </div>

This is a repository to help all readers who are interested in the data assessment and selection methods for Instruction Tuning of LLMs.

Most recent studies in 2023 and 2024 are investigated under the present survey. If your papers are missing or you have other requests, please contact to yuleichin@126.com. We will update this repository and paper on a regular basis to maintain up-to-date.

News📰

Citation🎓

@article{qin2024unleashingpowerdatatsunami,
     title={Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models}, 
     author={Yulei Qin and Yuncheng Yang and Pengcheng Guo and Gang Li and Hang Shao and Yuchen Shi and Zihan Xu and Yun Gu and Ke Li and Xing Sun},
     year={2024},
     url={https://arxiv.org/abs/2408.02085},
}

Papers📑

Overview📊

We present a unified organization of existing researches and categorize them in terms of the dimensionality of data assessment and selection.

Categorization of data assessment and selection methods for efficient LLM instruction tuning.

List of Papers with Categorization

All Papers are sorted chronologically according to three categories above, so that you can find related papers more quickly.

[Index: Quality-based Selection, Diversity-based Selection, Importance-based Selection]

Comprehensive Data Assessment and Selection
|--- A. Quality-based Selection
     |--- A.1. Hand-crafted Indicators: Vocabulary; Linguistics; Lexical and Semantic Analysis.
     |--- A.2. Model-based Indicators: Perplexity; Learning Complexity; Reward Score; Error Norm; Memorization; Instruction Following Difficulty; Predictability; Uncertainty.
     |--- A.3. GPT Score: Direct Scoring; Pairwise Ranking; Justification.
     |--- A.4. Human Evaluation: Instruction and Response Annotation; Direct Scoring; Pairwise Ranking.
|--- B. Diversity-based Selection
     |--- B.1. Hand-crafted Indicators: Lexical and Semantic Diversity; Statistics.
     |--- B.2. Model-based Indicators: Entropy; Simpson's Index; Vendi Score; Fisher Information Matrix; Tagging.
     |--- B.3. Geometry-based Coreset Sampling: __K__-center Greedy; Herding; Clustering.
     |--- B.4. Bilevel Optimization-based Coreset Sampling: Retrieve; Glister; Consistency Regularization; Entropy Regularization; Refined Minimal Size.
|--- C. Importance-based Selection
     |--- C.1. Hand-crafted Indicators: Readability Indices; Education-level Difficulty.
     |--- C.2. Model-based Indicators: Uncertainty; Reward Score; Datamodels.
     |--- C.3. Loss and Error-based Coreset Sampling: Forgetting; Memorization; Influence by Loss.
     |--- C.4. Gradient-based Coreset Sampling: Gradient Matching; Influence by Gradient.

Other Topics related to Data Curation in Training Language Models

We also provide list of papers for practical data curation in language modeling, ranging from dataset construction, dataset deduplication, and data mixture.

[Index: Dataset Construction and Synthesis, Dataset Deduplication, Dataset Mixture]

Tools

Some frequently used tools are collected for efficient processing of NLP corpus.

[Index: Tools]

Related Work

We provide related surveys on data measurement and selection for reference.

[Index: Surveys]

<a name="A"></a>

A. Quality-based Selection🏋️

A.1 Hand-crafted Indicators

YearTitleSummary
2020Do We Need to Create Big Datasets to Learn a Task?Small-yet-important datasets can be efficiently collected simply by iteratively adding sampled subsets from big sets that contribute to downstream metrics. The cost-efficient AFLite filtering strategy, together with pre-defined data quality indicators (DQI), further reduces the size of the chosen datasets.
2020DQI: Measuring Data Quality in NLPThe intuitive and manually-designed metrics for evaluating the data quality in NLP include vocabulary, inter-sample N-gram frequency and relation, inter-sample STS, intra-sample word similarity, intra-sample STS, N-Gram Frequency per Label, and Inter-split STS.
2024Data quality in NLP: Metrics and a comprehensive taxonomyThe comprehensive taxonomy for data quality in NLP is reviewed in terms of linguistic, semantic, anomaly, diversity, and classifier performance.

A.2 Model-based Indicators

YearTitleSummary
2019WinoGrande: An Adversarial Winograd Schema Challenge at ScaleFor filtering out low-quality datapoints, it is feasible to evaluate their predictability scores (namely the number of times that one datapoint is correctly classified / the number of total testing times) and choose the top ranked datapoints. Step 1. Pre-compute representation embeddings for all datapoints. Step 2. Randomly partition datasets into training and validation splits and train proxy models (e.g., linear classifier) on the training set. Step 3. Evaluate the testing set. Step 4. Iterate from Step 2 to Step 3 until a pre-defined number of testing times is reached. Step 5. Calculate the predictability scores and choose the top-ranked, thresholded datapoints. Step 6. Iterate from Step 2. to Step 5. until the data quota is met.
2020Adversarial Filters of Dataset BiasesThe AFLite aims at reducing the spurious bias of dataset and therefore improves model generalization on OOD samples.
2022Towards a Unified Multi-Dimensional Evaluator for Text GenerationThe evaluation on text corpus can be explained via naturalness, coherence, and understandability.
2023Instruction mining: High-quality instruction data selection for large language modelsThe loss of the base model on dev and test sets can be viewed as a proxy for measuring the quality of datasets. To avoid the high-cost of retraining and evaluation of base models, one efficient way is to directly estimate the loss of the model for each datapoint based on linear regression with quality indicators (e.g., input length, output length, reward score, perplexity, MTLD, KNN-i, and uni-eval metrics).
2023From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction TuningSelect 1K samples from each cluster of the fine-tuning datasets and construct "experiencing" models. Evaluate all datapoints using these models via instruction-following difficulty, which is defined as the conditioned answer score/direct answer score. Choose the datapoints with moderate IFD scores!
2023When Less is More: Investigating Data Pruning for Pretraining LLMs at ScaleCommon indicators (e.g., perplexity, EL2N, memorization ranking) are investigated for data quality measurement and dataset cleaning.
2023Mods: Model-oriented data selection for instruction tuningModels (LLMs) are investigated during data selection, where three metrics of datasets are defined: 1) quality, 2) coverage, and 3) necessity. For quality measurement, the reward model (via preference scoring) is used to rank all datapoints. For coverage, kcenter-greedy sampling is employed to reduce the number of samples without losing generalizability. For necessity, the responses generated from the model (fine-tuned using the kcenter-greedy sampled datapoints) are evaluated using the reward model. Samples that achieve low scores need to be added into the fine-tuning set.
2024An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language ModelsThe selection of datapoints (e.g., prompts) for supervised-finetuning can be mainly categorized as: 1) uncertainty-based selection; 2) k-center selection (e.g., k-center greedy); and 3) submodular selection (maximized diversity). Specifically, uncertainty metrics are defined as: 1) mean entropy; 2) least confidence; 3) mean margin; 4) min margin.
2024Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-TuningA small model (GPT2-125m) can be used to filter instruction-tuningdatasets for training a much larger and stronger model. Specifically, the perplexity and IFD scores of a datapoint from small language models are highly correlated with those from a LLM. Samples with top-k IFD scores (IFD score<1) are chosen for dataset reduction.
2024Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference ModelsThe perplexity of datapoints inferred from a small reference model can be used to prune datasets for training LLMs. Medium and High perplexity selected (via frequency) datapoints are the most beneficial ones. However, the marginal utility diminishes when more data (e.g., over the requisite by scaling laws) are involved or more training epochs are repeated.
2024Technical Report: Competition Solution For BetterMixtureGiver existing popular open-sourced datasets and the training budget (e.g., number of maximum training tokens), the best option to mix datasets for downstream performance lies in the details of filtering and balancing different datapoints. The entire pipeline includes deduplication (exact match), quality thresholding (language identification, perplexity, IFD scoring and voting), and diversity selection (kcenter-greedy sampling).

A.3. GPT Score

YearTitleSummary
2023Alpagasus: Training a better alpaca with fewer dataIt is surprisingly easy and effective to employ strongger models (e.g., GPT3.5, GPT4) to directly score datapoints in terms of helpfulness and accuracy. Note that coding datasets might not be fairly scored due to their nature.
2023Quantifying uncertainty in answers from any language model and enhancing their trustworthinessThe pipeline of BSDetector uses both self-consistency and direct scoring to estimate the confidence of a LLM on any given instruction triplet (instruction, content, answer).
2023Rethinking the Instruction Quality: LIFT is What You NeedThe pipeline of expansion first and compression next enhances both the diversity and quality of the original dataset. Data expansion is performed by GPT4 rewriting (depth, breath, CoT). Diversity is defined via PCA where samples of top-row variances are kept for diversity. Quality is measured by GPT4 direct scoring.
2024Automated data curation for robust language model fine-tuningThe pipeline of auto-cleaning instruction datasets consists of auto-filter and auto-correct. High-confident samples are first selected via the BSDetector to fine-tune a LLM. Then, candidate generated answers are inferred for all in-confident samples using the fine-tuned LLM. Finally, preference scoring between the original ground-truth answer and the generated answers are obtained using a base model where highly-confident generated answers are kept as the corrected answers.
2024Autonomous data selection with language models for mathematical textsA simple pipeline to filter out mathematic samples from open-sourced corpus for continue pretraining. Direct scoring via LLMs is effective in selecting high-quality sampels.
2024QuRating: Selecting High-Quality Data for Training Language ModelsQuRating defines quality criteria such as writing style, facts and trivia, educational value, and required expertise, and uses GPT-3.5-turbo to judge text pairs to generate labels for training the QuRater model. The fine-tuned QuRater model can then rate the quality of text. Experiments show that the language models trained with data selected by QuRating perform better than those trained with other data selection methods, and different quality criteria have different impacts on model performance, with educational value and required expertise being the most significant.

A.4. Human Evaluation

YearTitleSummary
2023Openassistant conversations-democratizing large language model alignmentEach example following the conversation tree structure is collected and annotated with human. Dataset pruning is performed with human preference (e.g., creativity, quality, humor, helpfulness, violence, rudeness).

<a name="B"></a>

B. Diversity-based Selection

B.1. Hand-crafted Indicators

YearTitleSummary
2010MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessmentMetrics on diversity measurement.
2011Efficient k-nearest neighbor graph construction for generic similarity measuresThe distance to approximate i-th nearest neighbors can be used for diversity measures.
2023Measuring Lexical Diversity in Texts: The Twofold Length ProblemMetrics on diversity measurement with respect to text length.

B.2. Model-based Indicators

YearTitleSummary
2023Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction TuningOne most cost-efficient way to perform instruction fine-tuning is simply choose the datapoints that highly resemble the downstream datapoints with limited instruction formats. The pipeline of the data pruning consists of embedding encoding and projecting, clustering, sampling, and model training and infering. Especially, the sampling by diversity is often adopted as an effective coreset sampling method.
2023Data similarity is not enough to explain language model performanceSimilarity metric is not well-correlated with the downstreaming performance for pre-training models. Existing similarity methods (e.g., embedding similarities, token/n-gram distributions, and perplexity) are not correlated. The difficulty/complexity of downstreaming task datapoints (e.g., performance) is NOT necessarily associated with involving their similar counterparts in the pre-training corpus.

B.3. Geometry-based Coreset Sampling

YearTitleSummary
2023Dataset QuantizationIt proposes a scalable dataset compression method (DQ) that first divides the entire set into a set of non-overlapping bins via the submodular gains recursively. Such division is performed in the embedding space to maximize the diversity gains. Then, the selection of samples is performed via uniformly sampling from each bin to maximize the overall diversity. DQ outperforms traditional diversity-based sampling methods.
2023Data diversity matters for robust instruction tuningThe trade-off exists between the quality and diversity of datapoints. Diversity-prioritized datapoints can improve the worst-case performance. Diversity measures: maximized similarity between the sum of selected datapoints and the newly added one from the remaining full set. Quality measures: ChatGPT direct scoring and reward model preference scoring.
2023What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuningDatasets can be measured from three dimensions: complexity, quality, and diversity. A dataset is obtained by first performing evolution on instruction complexity and response quality of datasets via Evo-Instruct, respectively on each sample and rank these variants from high to low. Subsequently, diversity is considered where newly added samples should share low similarity with the existing dataset.
2024From Random to Informed Data Selection: A Diversity-Based Approach to Optimize Human Annotation and Few-Shot LearningProposes an automatic data selection architecture for few-shot learning, including three methods: Reverse Semantic Search (RSS), Ordered Clustering (OC), and Limited Lexical Similarity (LLS). Introducing the automatic data selection architecture based on active learning principles to identify the most informative and representative data points for annotation. Conducting an extensive analysis of the architecture's various implementations, highlighting its effectiveness in building the first version of a dataset in the context of low-resource text classification.
2024Balanced Data Sampling for Language Model Training with ClusteringCluster first, then uniformly sample datapoints from each cluster until exhaustion.
2024Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and BeyondA new data selection approach based on k-means clustering and sensitivity sampling can be applied for fine-tuning foundation models.
2024Exploring Learning Complexity for Downstream Data PruningThe learning complexity of datapoints iss defined as the averaged prediction confidence of subnets with different capacity (predicted label consistency between kNN samples for classification and the sum of perplexity reciprocal for regression). The principle of pruning is keeping the easy and diverse samples.

B.4. Bilevel Optimization-based Coreset Sampling

YearTitleSummary
2024Refined Coreset Selection: Towards Minimal Coreset Size under Model Performance ConstraintsThe selection of datapoints under a constrained budget can be implemented as the lexicographic bilevel-optimization, where the inner loop optimizes model parameters and the outer loop optimizes data selection. When optimizing the selection mask, the minimization of loss terms is relaxed to allow smaller dataset size.

<a name="C"></a>

C. Importance-based Selection🏗️

C.1. Hand-crafted Indicators

YearTitleSummary
2022Do NLP and machine learning improve traditional readability formulas?The investigation of NLP-enabled features and machine learning techniques benefits the development of readability metrics. Both these "non-classic" and classific readability formulas can be combined for better performance of readability measures.
2024Dele: Data Efficient LLM EvaluationAn adaptive effective sampling method can expedite LLM evaluation without losing discriminability of existing benchmarks. The candidate pool of sampling methods include: 1) random sampling; 2) clustering-based sampling (e.g., topic-modeling, DBScan, LDA, k-means, spectral); 3) quality-based sampling (spelling errors, average word-length, count of repeating words, compound probability distribution, lexical diversity); 4) difficulty-based sampling (difficult-words percentage, dale-chall formula, flesh reading ease, gunning fog).

C.2. Model-based Indicators

YearTitleSummary
2024Data selection for language models via importance resamplingHashed n-gram features are fast and effective as representation embeddings. The importance of each datapoint is estimated by a bag of hashed n-grams model, where samples with higher probability (present in the target data) are assigned higher weights. Given any downstream datasets, the most similar samples in the pretraining corpus are NOT necessarily leading to the best downstreaming performance. But the most dissimilar ones do perform worst.
2024Dsdm: Model-aware dataset selection with datamodelsDatamodels can be simply defined as a linear model (e.g., logistic regression) and optimized via TARK estimator. The selected datapoints are NOT necessarily similar to the samples in the downstream tasks.
2024MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence ModelsA data influence model (e.g., BERT-base), updating alternatively, continuously adapts to the evolving data preferences of the pretrained model (e.g., Pythia-410M/1B) and selects the most effective datapoints for the current pretraining.

C.3. Loss and Error-based Coreset Sampling

YearTitleSummary
2020What Neural Networks Memorize and Why: Discovering the Long Tail via Influence EstimationThe value estimators of memorization and influence help pinpoint the most important datapoints that affect test-time performance significantly.
2022Beyond neural scaling laws: beating power law scaling via data pruningApart from the data measurement metrics, the proportion of pruned data with respect to the model size matters. Keep easy samples from small datasets, and difficult samples from big datasets.
2023Data selection for fine-tuning large language models using transferred shapley valuesSharply values can be approximated and aggregated to sample datasets, where the lowest scored samples are removed first until the proxy model (A_src) reaches the optimal performance on the validation set. Then, the selected dataset is used to train the target model (A_tgt).
2023Skill-it! a data-driven skills framework for understanding and training language modelsThe training of LLMs should follow a certain natural order that mimics how humans acquire independent skills and knowledge. It first estimates the skill affinity matrix (pre-requisite edges) for each training and validation skill, and then performs skill graph-aware coreset sampling for online learning.

C.4. Gradient-based Coreset Sampling

YearTitleSummary
2021Deep Learning on a Data Diet: Finding Important Examples Early in TrainingImportant samples can be found at an early stage using indicators like forgetting score, gradient norm (GraNd), and Error l2 norm (EL2N). A high ratio of pruning would degrade overall performance due to overfitting of samples with label errors or high difficulty.
2023Gio: Gradient information optimization for training dataset selectionTo keep the selected dataset representative, it is feasible to use the KL-divergence as an measure between the sampled dataset and the target dataset (e.g., downstream datasets). Given an embedding model, the selection is performed by minimizing the KL divergence between two distribution where the newly added datapoint is determined by gradient information of the KL divergence.
2024LESS: Selecting Influential Data for Targeted Instruction TuningThe targeted instruction tuning aims at finding the most influential data that improves the downstreaming performance of LLMs, where low-rank gradient similarity search is developed to efficiently pinpoint the training examples that resemble the few-shot testing cases. Specifically, for each dataset, the base model is warmed-up by experiencing a small portion of samples for more accurate, stable estimation of gradients. Then, the gradient-based trajectory influence estimation is extended to work with Adam optimizer. The LoRA technique, together with random projection, is involved to efficiently compute and store the gradient. Finally, the average gradient of testing cases over epochs is computed for similarity measurement with respect to the gradients of each training case, where those top-ranked samples are the most influential ones.
2024TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning DataTAGCOS leverages sample gradients as the representations of datapoints and performs k-means clustering to group similar datapoints. Then, in each cluster, TAGCOS applies an efficient greedy algorithm to select the subset whose gradient is close to that of the entire cluster set.

<a name="D"></a>

D. Dataset Construction and Synthesis🚧

Self-Instruct🤳

Evo-Instruct🐛

<a name="E"></a>

E. Dataset Deduplication🦄

Exact Match🟰

Semantics🔤

<a name="F"></a>

F. Dataset Mixture🥄

<a name="G"></a>

G. Tools🛠️

<a name="H"></a>

H. Surveys📝

:man_astronaut: Show your support☕️

Give a ⭐️ if this project helped you!