Awesome
A Survey on Data Selection for Language Models
This repo is a convenient listing of papers relevant to data selection for language models, during all stages of training. This is meant to be a resource for the community, so please contribute if you see anything missing!
For more detail on these works, and more, see our survey paper: A Survey on Data Selection for Language Models. By this incredible team: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang
<img src="fig1.png" alt="A conceptual demonstration of the data pipeline for language model training" width=75% align="center">Table of Contents
- Data Selection for Pretraining
- Data Selection for Instruction-Tuning and Multitask Training
- Data Selection for Preference Fine-tuning Alignment
- Data Selection for In-Context Learning
- Data Selection for Task-specific Fine-tuning
Data Selection for Pretraining
<img src="learning-stages-pretraining.png" alt="Conceptualization of objectives and constraints on data selection for pretraining" width=50% align="right">Language Filtering
- FastText.zip: Compressing text classification models: 2016<br/> Armand Joulin and Edouard Grave and Piotr Bojanowski and Matthijs Douze and Hérve Jégou and Tomas Mikolov<br/>
- Learning Word Vectors for 157 Languages: 2018<br/> Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas<br/>
- Cross-lingual Language Model Pretraining: 2019<br/> Conneau, Alexis and Lample, Guillaume<br/>
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: 2020<br/> Raffel, Colin and Shazeer, Noam and Roberts, Adam... 3 hidden ... Zhou, Yanqi and Li, Wei and Liu, Peter J.<br/>
- Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus: 2020<br/> Caswell, Isaac and Breiner, Theresa and van Esch, Daan and Bapna, Ankur<br/>
- Unsupervised Cross-lingual Representation Learning at Scale: 2020<br/> Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman... 4 hidden ... Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin<br/>
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data: 2020<br/> Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis... 1 hidden ... Guzm'an, Francisco and Joulin, Armand and Grave, Edouard<br/>
- A reproduction of Apple's bi-directional LSTM models for language identification in short strings: 2021<br/> Toftrup, Mads and Asger Sorensen, Soren and Ciosici, Manuel R. and Assent, Ira<br/>
- Evaluating Large Language Models Trained on Code: 2021<br/> Mark Chen and Jerry Tworek and Heewoo Jun... 52 hidden ... Sam McCandlish and Ilya Sutskever and Wojciech Zaremba<br/>
- mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer: 2021<br/> Xue, Linting and Constant, Noah and Roberts, Adam... 2 hidden ... Siddhant, Aditya and Barua, Aditya and Raffel, Colin<br/>
- Competition-level code generation with AlphaCode: 2022<br/> Li, Yujia and Choi, David and Chung, Junyoung... 20 hidden ... de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol<br/>
- PaLM: Scaling Language Modeling with Pathways: 2022<br/> Aakanksha Chowdhery and Sharan Narang and Jacob Devlin... 61 hidden ... Jeff Dean and Slav Petrov and Noah Fiedel<br/>
- The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset: 2022<br/> Laurenccon, Hugo and Saulnier, Lucile and Wang, Thomas... 48 hidden ... Mitchell, Margaret and Luccioni, Sasha Alexandra and Jernite, Yacine<br/>
- Writing System and Speaker Metadata for 2,800+ Language Varieties: 2022<br/> van Esch, Daan and Lucassen, Tamar and Ruder, Sebastian and Caswell, Isaac and Rivera, Clara<br/>
- FinGPT: Large Generative Models for a Small Language: 2023<br/> Luukkonen, Risto and Komulainen, Ville and Luoma, Jouni... 5 hidden ... Muennighoff, Niklas and Piktus, Aleksandra and others<br/>
- MC^ 2: A Multilingual Corpus of Minority Languages in China: 2023<br/> Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong<br/>
- Madlad-400: A multilingual and document-level large audited dataset: 2023<br/> Kudugunta, Sneha and Caswell, Isaac and Zhang, Biao... 5 hidden ... Stella, Romi and Bapna, Ankur and others<br/>
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only: 2023<br/> Guilherme Penedo and Quentin Malartic and Daniel Hesslow... 3 hidden ... Baptiste Pannier and Ebtesam Almazrouei and Julien Launay<br/>
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research: 2024<br/> Luca Soldaini and Rodney Kinney and Akshita Bhagia... 30 hidden ... Dirk Groeneveld and Jesse Dodge and Kyle Lo<br/>
Heuristic Approaches
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: 2020<br/> Raffel, Colin and Shazeer, Noam and Roberts, Adam... 3 hidden ... Zhou, Yanqi and Li, Wei and Liu, Peter J.<br/>
- Language Models are Few-Shot Learners: 2020<br/> Brown, Tom and Mann, Benjamin and Ryder, Nick... 25 hidden ... Radford, Alec and Sutskever, Ilya and Amodei, Dario<br/>
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling: 2020<br/> Leo Gao and Stella Biderman and Sid Black... 6 hidden ... Noa Nabeshima and Shawn Presser and Connor Leahy<br/>
- Evaluating Large Language Models Trained on Code: 2021<br/> Mark Chen and Jerry Tworek and Heewoo Jun... 52 hidden ... Sam McCandlish and Ilya Sutskever and Wojciech Zaremba<br/>
- mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer: 2021<br/> Xue, Linting and Constant, Noah and Roberts, Adam... 2 hidden ... Siddhant, Aditya and Barua, Aditya and Raffel, Colin<br/>
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher: 2022<br/> Jack W. Rae and Sebastian Borgeaud and Trevor Cai... 74 hidden ... Demis Hassabis and Koray Kavukcuoglu and Geoffrey Irving<br/>
- The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset: 2022<br/> Laurenccon, Hugo and Saulnier, Lucile and Wang, Thomas... 48 hidden ... Mitchell, Margaret and Luccioni, Sasha Alexandra and Jernite, Yacine<br/>
- HTLM: Hyper-Text Pre-Training and Prompting of Language Models: 2022<br/> Armen Aghajanyan and Dmytro Okhonko and Mike Lewis... 1 hidden ... Hu Xu and Gargi Ghosh and Luke Zettlemoyer<br/>
- LLaMA: Open and Efficient Foundation Language Models: 2023<br/> Hugo Touvron and Thibaut Lavril and Gautier Izacard... 8 hidden ... Armand Joulin and Edouard Grave and Guillaume Lample<br/>
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only: 2023<br/> Guilherme Penedo and Quentin Malartic and Daniel Hesslow... 3 hidden ... Baptiste Pannier and Ebtesam Almazrouei and Julien Launay<br/>
- The foundation model transparency index: 2023<br/> Bommasani, Rishi and Klyman, Kevin and Longpre, Shayne... 2 hidden ... Xiong, Betty and Zhang, Daniel and Liang, Percy<br/>
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research: 2024<br/> Luca Soldaini and Rodney Kinney and Akshita Bhagia... 30 hidden ... Dirk Groeneveld and Jesse Dodge and Kyle Lo<br/>
Data Quality
- KenLM: Faster and Smaller Language Model Queries: 2011<br/> Heafield, Kenneth<br/>
- FastText.zip: Compressing text classification models: 2016<br/> Armand Joulin and Edouard Grave and Piotr Bojanowski and Matthijs Douze and Hérve Jégou and Tomas Mikolov<br/>
- Learning Word Vectors for 157 Languages: 2018<br/> Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas<br/>
- Language Models are Unsupervised Multitask Learners: 2019<br/> Alec Radford and Jeff Wu and Rewon Child and David Luan and Dario Amodei and Ilya Sutskever<br/>
- Language Models are Few-Shot Learners: 2020<br/> Brown, Tom and Mann, Benjamin and Ryder, Nick... 25 hidden ... Radford, Alec and Sutskever, Ilya and Amodei, Dario<br/>
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling: 2020<br/> Leo Gao and Stella Biderman and Sid Black... 6 hidden ... Noa Nabeshima and Shawn Presser and Connor Leahy<br/>
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data: 2020<br/> Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis... 1 hidden ... Guzm'an, Francisco and Joulin, Armand and Grave, Edouard<br/>
- Detoxifying language models risks marginalizing minority voices: 2021<br/> Xu, Albert and Pathak, Eshaan and Wallace, Eric and Gururangan, Suchin and Sap, Maarten and Klein, Dan<br/>
- PaLM: Scaling Language Modeling with Pathways: 2022<br/> Aakanksha Chowdhery and Sharan Narang and Jacob Devlin... 61 hidden ... Jeff Dean and Slav Petrov and Noah Fiedel<br/>
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher: 2022<br/> Jack W. Rae and Sebastian Borgeaud and Trevor Cai... 74 hidden ... Demis Hassabis and Koray Kavukcuoglu and Geoffrey Irving<br/>
- Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection: 2022<br/> Gururangan, Suchin and Card, Dallas and Dreier, Sarah... 2 hidden ... Wang, Zeyu and Zettlemoyer, Luke and Smith, Noah A.<br/>
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts: 2022<br/> Du, Nan and Huang, Yanping and Dai, Andrew M... 21 hidden ... Wu, Yonghui and Chen, Zhifeng and Cui, Claire<br/>
- A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity: 2023<br/> Shayne Longpre and Gregory Yauney and Emily Reif... 5 hidden ... Kevin Robinson and David Mimno and Daphne Ippolito<br/>
- Data Selection for Language Models via Importance Resampling: 2023<br/> Sang Michael Xie and Shibani Santurkar and Tengyu Ma and Percy Liang<br/>
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only: 2023<br/> Guilherme Penedo and Quentin Malartic and Daniel Hesslow... 3 hidden ... Baptiste Pannier and Ebtesam Almazrouei and Julien Launay<br/>
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research: 2024<br/> Luca Soldaini and Rodney Kinney and Akshita Bhagia... 30 hidden ... Dirk Groeneveld and Jesse Dodge and Kyle Lo<br/>
- Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale: 2024<br/> Fan Zhou and Zengzhi Wang and Qian Liu and Junlong Li and Pengfei Liu<br/>
Domain-Specific Selection
- Text data acquisition for domain-specific language models: 2006<br/> Sethy, Abhinav and Georgiou, Panayiotis G. and Narayanan, Shrikanth<br/>
- Intelligent Selection of Language Model Training Data: 2010<br/> Moore, Robert C. and Lewis, William<br/>
- Cynical Selection of Language Model Training Data: 2017<br/> Amittai Axelrod<br/>
- Automatic Document Selection for Efficient Encoder Pretraining: 2022<br/> Feng, Yukun and Xia, Patrick and Van Durme, Benjamin and Sedoc, Jo~ao<br/>
- Data Selection for Language Models via Importance Resampling: 2023<br/> Sang Michael Xie and Shibani Santurkar and Tengyu Ma and Percy Liang<br/>
- DsDm: Model-Aware Dataset Selection with Datamodels: 2024<br/> Logan Engstrom and Axel Feldmann and Aleksander Madry<br/>
Data Deduplication
- Space/time trade-offs in hash coding with allowable errors: 1970<br/> Bloom, Burton H.<br/>
- Suffix Arrays: A New Method for On-Line String Searches: 1993<br/> Manber, Udi and Myers, Gene<br/>
- On the resemblance and containment of documents: 1997<br/> Broder, A.Z.<br/>
- Similarity Estimation Techniques from Rounding Algorithms: 2002<br/> Charikar, Moses S.<br/>
- URL normalization for de-duplication of web pages: 2009<br/> Agarwal, Amit and Koppula, Hema Swetha and Leela, Krishna P.... 3 hidden ... Haty, Chittaranjan and Roy, Anirban and Sasturkar, Amit<br/>
- Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures: 2019<br/> Pedro Javier Ortiz Su'arez and Beno^it Sagot and Laurent Romary<br/>
- Language Models are Few-Shot Learners: 2020<br/> Brown, Tom and Mann, Benjamin and Ryder, Nick... 25 hidden ... Radford, Alec and Sutskever, Ilya and Amodei, Dario<br/>
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling: 2020<br/> Leo Gao and Stella Biderman and Sid Black... 6 hidden ... Noa Nabeshima and Shawn Presser and Connor Leahy<br/>
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data: 2020<br/> Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis... 1 hidden ... Guzm'an, Francisco and Joulin, Armand and Grave, Edouard<br/>
- Beyond neural scaling laws: beating power law scaling via data pruning: 2022<br/> Ben Sorscher and Robert Geirhos and Shashank Shekhar and Surya Ganguli and Ari S. Morcos<br/>
- Deduplicating Training Data Makes Language Models Better: 2022<br/> Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew... 1 hidden ... Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas<br/>
- MTEB: Massive text embedding benchmark: 2022<br/> Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo"ic and Reimers, Nils<br/>
- PaLM: Scaling Language Modeling with Pathways: 2022<br/> Aakanksha Chowdhery and Sharan Narang and Jacob Devlin... 61 hidden ... Jeff Dean and Slav Petrov and Noah Fiedel<br/>
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher: 2022<br/> Jack W. Rae and Sebastian Borgeaud and Trevor Cai... 74 hidden ... Demis Hassabis and Koray Kavukcuoglu and Geoffrey Irving<br/>
- Sgpt: Gpt sentence embeddings for semantic search: 2022<br/> Muennighoff, Niklas<br/>
- The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset: 2022<br/> Laurenccon, Hugo and Saulnier, Lucile and Wang, Thomas... 48 hidden ... Mitchell, Margaret and Luccioni, Sasha Alexandra and Jernite, Yacine<br/>
- C-pack: Packaged resources to advance general chinese embedding: 2023<br/> Xiao, Shitao and Liu, Zheng and Zhang, Peitian and Muennighoff, Niklas<br/>
- D4: Improving LLM Pretraining via Document De-Duplication and Diversification: 2023<br/> Kushal Tirumala and Daniel Simig and Armen Aghajanyan and Ari S. Morcos<br/>
- Large-scale Near-deduplication Behind BigCode: 2023<br/> Mou, Chenghao<br/>
- Paloma: A Benchmark for Evaluating Language Model Fit: 2023<br/> Ian Magnusson and Akshita Bhagia and Valentin Hofmann... 10 hidden ... Noah A. Smith and Kyle Richardson and Jesse Dodge<br/>
- Quantifying Memorization Across Neural Language Models: 2023<br/> Nicholas Carlini and Daphne Ippolito and Matthew Jagielski and Katherine Lee and Florian Tramer and Chiyuan Zhang<br/>
- SemDeDup: Data-efficient learning at web-scale through semantic deduplication: 2023<br/> Abbas, Amro and Tirumala, Kushal and Simig, D'aniel and Ganguli, Surya and Morcos, Ari S<br/>
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only: 2023<br/> Guilherme Penedo and Quentin Malartic and Daniel Hesslow... 3 hidden ... Baptiste Pannier and Ebtesam Almazrouei and Julien Launay<br/>
- What's In My Big Data?: 2023<br/> Elazar, Yanai and Bhagia, Akshita and Magnusson, Ian... 5 hidden ... Soldaini, Luca and Singh, Sameer and others<br/>
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research: 2024<br/> Luca Soldaini and Rodney Kinney and Akshita Bhagia... 30 hidden ... Dirk Groeneveld and Jesse Dodge and Kyle Lo<br/>
- Generative Representational Instruction Tuning: 2024<br/> Muennighoff, Niklas and Su, Hongjin and Wang, Liang... 2 hidden ... Yu, Tao and Singh, Amanpreet and Kiela, Douwe<br/>
Filtering Toxic and Explicit Content
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: 2020<br/> Raffel, Colin and Shazeer, Noam and Roberts, Adam... 3 hidden ... Zhou, Yanqi and Li, Wei and Liu, Peter J.<br/>
- mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer: 2021<br/> Xue, Linting and Constant, Noah and Roberts, Adam... 2 hidden ... Siddhant, Aditya and Barua, Aditya and Raffel, Colin<br/>
- Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data: 2022<br/> Tim Jansen and Yangling Tong and Victoria Zevallos and Pedro Ortiz Suarez<br/>
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher: 2022<br/> Jack W. Rae and Sebastian Borgeaud and Trevor Cai... 74 hidden ... Demis Hassabis and Koray Kavukcuoglu and Geoffrey Irving<br/>
- The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset: 2022<br/> Laurenccon, Hugo and Saulnier, Lucile and Wang, Thomas... 48 hidden ... Mitchell, Margaret and Luccioni, Sasha Alexandra and Jernite, Yacine<br/>
- Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection: 2022<br/> Gururangan, Suchin and Card, Dallas and Dreier, Sarah... 2 hidden ... Wang, Zeyu and Zettlemoyer, Luke and Smith, Noah A.<br/>
- A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity: 2023<br/> Shayne Longpre and Gregory Yauney and Emily Reif... 5 hidden ... Kevin Robinson and David Mimno and Daphne Ippolito<br/>
- AI image training dataset found to include child sexual abuse imagery: 2023<br/> David, Emilia<br/>
- Detecting Personal Information in Training Corpora: an Analysis: 2023<br/> Subramani, Nishant and Luccioni, Sasha and Dodge, Jesse and Mitchell, Margaret<br/>
- GPT-4 Technical Report: 2023<br/> OpenAI and : and Josh Achiam... 276 hidden ... Juntang Zhuang and William Zhuk and Barret Zoph<br/>
- SantaCoder: don't reach for the stars!: 2023<br/> Allal, Loubna Ben and Li, Raymond and Kocetkov, Denis... 5 hidden ... Gu, Alex and Dey, Manan and others<br/>
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only: 2023<br/> Guilherme Penedo and Quentin Malartic and Daniel Hesslow... 3 hidden ... Baptiste Pannier and Ebtesam Almazrouei and Julien Launay<br/>
- The foundation model transparency index: 2023<br/> Bommasani, Rishi and Klyman, Kevin and Longpre, Shayne... 2 hidden ... Xiong, Betty and Zhang, Daniel and Liang, Percy<br/>
- What's In My Big Data?: 2023<br/> Elazar, Yanai and Bhagia, Akshita and Magnusson, Ian... 5 hidden ... Soldaini, Luca and Singh, Sameer and others<br/>
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research: 2024<br/> Luca Soldaini and Rodney Kinney and Akshita Bhagia... 30 hidden ... Dirk Groeneveld and Jesse Dodge and Kyle Lo<br/>
- OLMo: Accelerating the Science of Language Models: 2024<br/> Groeneveld, Dirk and Beltagy, Iz and Walsh, Pete... 5 hidden ... Magnusson, Ian and Wang, Yizhong and others<br/>
Specialized Selection for Multilingual Models
- Bloom: A 176b-parameter open-access multilingual language model: 2022<br/> Workshop, BigScience and Scao, Teven Le and Fan, Angela... 5 hidden ... Luccioni, Alexandra Sasha and Yvon, Franccois and others<br/>
- Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets: 2022<br/> Kreutzer, Julia and Caswell, Isaac and Wang, Lisa... 46 hidden ... Ahia, Oghenefego and Agrawal, Sweta and Adeyemi, Mofetoluwa<br/>
- The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset: 2022<br/> Laurenccon, Hugo and Saulnier, Lucile and Wang, Thomas... 48 hidden ... Mitchell, Margaret and Luccioni, Sasha Alexandra and Jernite, Yacine<br/>
- What language model to train if you have one million gpu hours?: 2022<br/> Scao, Teven Le and Wang, Thomas and Hesslow, Daniel... 5 hidden ... Muennighoff, Niklas and Phang, Jason and others<br/>
- Madlad-400: A multilingual and document-level large audited dataset: 2023<br/> Kudugunta, Sneha and Caswell, Isaac and Zhang, Biao... 5 hidden ... Stella, Romi and Bapna, Ankur and others<br/>
- Scaling multilingual language models under constrained data: 2023<br/> Scao, Teven Le<br/>
- Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning: 2024<br/> Shivalika Singh and Freddie Vargus and Daniel Dsouza... 27 hidden ... Ahmet Üstün and Marzieh Fadaee and Sara Hooker<br/>
Data Mixing
- The Nonstochastic Multiarmed Bandit Problem: 2002<br/> Auer, Peter and Cesa-Bianchi, Nicol`o and Freund, Yoav and Schapire, Robert E.<br/>
- Distributionally Robust Language Modeling: 2019<br/> Oren, Yonatan and Sagawa, Shiori and Hashimoto, Tatsunori B. and Liang, Percy<br/>
- Distributionally Robust Neural Networks: 2020<br/> Shiori Sagawa and Pang Wei Koh and Tatsunori B. Hashimoto and Percy Liang<br/>
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: 2020<br/> Raffel, Colin and Shazeer, Noam and Roberts, Adam... 3 hidden ... Zhou, Yanqi and Li, Wei and Liu, Peter J.<br/>
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling: 2020<br/> Leo Gao and Stella Biderman and Sid Black... 6 hidden ... Noa Nabeshima and Shawn Presser and Connor Leahy<br/>
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher: 2022<br/> Jack W. Rae and Sebastian Borgeaud and Trevor Cai... 74 hidden ... Demis Hassabis and Koray Kavukcuoglu and Geoffrey Irving<br/>
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts: 2022<br/> Du, Nan and Huang, Yanping and Dai, Andrew M... 21 hidden ... Wu, Yonghui and Chen, Zhifeng and Cui, Claire<br/>
- Cross-Lingual Supervision improves Large Language Models Pre-training: 2023<br/> Andrea Schioppa and Xavier Garcia and Orhan Firat<br/>
- [DoGE: Domain Reweighting with Generalization Estimation](https://arxiv.org/abs/arXiv preprint): 2023<br/> Simin Fan and Matteo Pagliardini and Martin Jaggi<br/>
- DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining: 2023<br/> Sang Michael Xie and Hieu Pham and Xuanyi Dong... 4 hidden ... Quoc V Le and Tengyu Ma and Adams Wei Yu<br/>
- Efficient Online Data Mixing For Language Model Pre-Training: 2023<br/> Alon Albalak and Liangming Pan and Colin Raffel and William Yang Wang<br/>
- LLaMA: Open and Efficient Foundation Language Models: 2023<br/> Hugo Touvron and Thibaut Lavril and Gautier Izacard... 8 hidden ... Armand Joulin and Edouard Grave and Guillaume Lample<br/>
- Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling: 2023<br/> Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin Gregory... 7 hidden ... Skowron, Aviya and Sutawika, Lintang and Van Der Wal, Oskar<br/>
- Scaling Data-Constrained Language Models: 2023<br/> Niklas Muennighoff and Alexander M Rush and Boaz Barak... 3 hidden ... Sampo Pyysalo and Thomas Wolf and Colin Raffel<br/>
- Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning: 2023<br/> Mengzhou Xia and Tianyu Gao and Zhiyuan Zeng and Danqi Chen<br/>
- Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models: 2023<br/> Mayee F. Chen and Nicholas Roberts and Kush Bhatia... 1 hidden ... Ce Zhang and Frederic Sala and Christopher Ré<br/>
Data Selection for Instruction-Tuning and Multitask Training
<img src="learning-stages-instruction.png" alt="Conceptualization of objectives and constraints on data selection for instruction-tuning" width=50% align="right">- The natural language decathlon: Multitask learning as question answering: 2018<br/> McCann, Bryan and Keskar, Nitish Shirish and Xiong, Caiming and Socher, Richard<br/>
- Unifying question answering, text classification, and regression via span extraction: 2019<br/> Keskar, Nitish Shirish and McCann, Bryan and Xiong, Caiming and Socher, Richard<br/>
- Multi-Task Deep Neural Networks for Natural Language Understanding: 2019<br/> Liu, Xiaodong and He, Pengcheng and Chen, Weizhu and Gao, Jianfeng<br/>
- UnifiedQA: Crossing Format Boundaries with a Single QA System: 2020<br/> Khashabi, Daniel and Min, Sewon and Khot, Tushar... 1 hidden ... Tafjord, Oyvind and Clark, Peter and Hajishirzi, Hannaneh<br/>
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: 2020<br/> Raffel, Colin and Shazeer, Noam and Roberts, Adam... 3 hidden ... Zhou, Yanqi and Li, Wei and Liu, Peter J.<br/>
- Muppet: Massive Multi-task Representations with Pre-Finetuning: 2021<br/> Aghajanyan, Armen and Gupta, Anchit and Shrivastava, Akshat and Chen, Xilun and Zettlemoyer, Luke and Gupta, Sonal<br/>
- Finetuned language models are zero-shot learners: 2021<br/> Wei, Jason and Bosma, Maarten and Zhao, Vincent Y.... 3 hidden ... Du, Nan and Dai, Andrew M. and Le, Quoc V.<br/>
- Cross-task generalization via natural language crowdsourcing instructions: 2021<br/> Mishra, Swaroop and Khashabi, Daniel and Baral, Chitta and Hajishirzi, Hannaneh<br/>
- Nl-augmenter: A framework for task-sensitive natural language augmentation: 2021<br/> Dhole, Kaustubh D and Gangal, Varun and Gehrmann, Sebastian... 5 hidden ... Shrivastava, Ashish and Tan, Samson and others<br/>
- Ext5: Towards extreme multi-task scaling for transfer learning: 2021<br/> Aribandi, Vamsi and Tay, Yi and Schuster, Tal... 5 hidden ... Bahri, Dara and Ni, Jianmo and others<br/>
- Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks: 2022<br/> Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah... 29 hidden ... Patro, Sumanta and Dixit, Tanay and Shen, Xudong<br/>
- Scaling instruction-finetuned language models: 2022<br/> Chung, Hyung Won and Hou, Le and Longpre, Shayne... 5 hidden ... Dehghani, Mostafa and Brahma, Siddhartha and others<br/>
- Bloom+ 1: Adding language support to bloom for zero-shot prompting: 2022<br/> Yong, Zheng-Xin and Schoelkopf, Hailey and Muennighoff, Niklas... 5 hidden ... Kasai, Jungo and Baruwa, Ahmed and others<br/>
- OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization: 2022<br/> Srinivasan Iyer and Xi Victoria Lin and Ramakanth Pasunuru... 12 hidden ... Asli Celikyilmaz and Luke Zettlemoyer and Ves Stoyanov<br/>
- MetaICL: Learning to Learn In Context: 2022<br/> Min, Sewon and Lewis, Mike and Zettlemoyer, Luke and Hajishirzi, Hannaneh<br/>
- Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor: 2022<br/> Honovich, Or and Scialom, Thomas and Levy, Omer and Schick, Timo<br/>
- Crosslingual generalization through multitask finetuning: 2022<br/> Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang... 5 hidden ... Yong, Zheng-Xin and Schoelkopf, Hailey and others<br/>
- Multitask Prompted Training Enables Zero-Shot Task Generalization: 2022<br/> Victor Sanh and Albert Webson and Colin Raffel... 34 hidden ... Leo Gao and Thomas Wolf and Alexander M Rush<br/>
- Prometheus: Inducing fine-grained evaluation capability in language models: 2023<br/> Kim, Seungone and Shin, Jamin and Cho, Yejin... 5 hidden ... Kim, Sungdong and Thorne, James and others<br/>
- SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification: 2023<br/> Wing Lian and Guan Wang and Bleys Goodson... 1 hidden ... Austin Cook and Chanvichet Vong and "Teknium"<br/>
- Is A.I. Art Stealing from Artists?: 2023<br/> Chayka, Kyle<br/>
- Paul Tremblay, Mona Awad vs. OpenAI, Inc., et al.: 2023<br/> Saveri, Joseph R. and Zirpoli, Cadio and Young, Christopher K.L. and McMahon, Kathleen J.<br/>
- Making Large Language Models Better Data Creators: 2023<br/> Lee, Dong-Ho and Pujara, Jay and Sewak, Mohit and White, Ryen and Jauhar, Sujay<br/>
- The Flan Collection: Designing Data and Methods for Effective Instruction Tuning: 2023<br/> Shayne Longpre and Le Hou and Tu Vu... 5 hidden ... Barret Zoph and Jason Wei and Adam Roberts<br/>
- Wizardlm: Empowering large language models to follow complex instructions: 2023<br/> Xu, Can and Sun, Qingfeng and Zheng, Kai... 2 hidden ... Feng, Jiazhan and Tao, Chongyang and Jiang, Daxin<br/>
- LIMA: Less Is More for Alignment: 2023<br/> Chunting Zhou and Pengfei Liu and Puxin Xu... 9 hidden ... Mike Lewis and Luke Zettlemoyer and Omer Levy<br/>
- Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2: 2023<br/> Hamish Ivison and Yizhong Wang and Valentina Pyatkin... 5 hidden ... Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi<br/>
- Self-Instruct: Aligning Language Models with Self-Generated Instructions: 2023<br/> Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop... 1 hidden ... Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh<br/>
- What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning: 2023<br/> Liu, Wei and Zeng, Weihao and He, Keqing and Jiang, Yong and He, Junxian<br/>
- Instruction Tuning for Large Language Models: A Survey: 2023<br/> Shengyu Zhang and Linfeng Dong and Xiaoya Li... 5 hidden ... Tianwei Zhang and Fei Wu and Guoyin Wang<br/>
- Stanford Alpaca: An Instruction-following LLaMA model: 2023<br/> Rohan Taori and Ishaan Gulrajani and Tianyi Zhang... 2 hidden ... Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto <br/>
- How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources: 2023<br/> Yizhong Wang and Hamish Ivison and Pradeep Dasigi... 5 hidden ... Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi<br/>
- OpenAssistant Conversations--Democratizing Large Language Model Alignment: 2023<br/> K"opf, Andreas and Kilcher, Yannic and von R"utte, Dimitri... 5 hidden ... Stanley, Oliver and Nagyfi, Rich'ard and others<br/>
- OctoPack: Instruction Tuning Code Large Language Models: 2023<br/> Niklas Muennighoff and Qian Liu and Armel Zebaze... 4 hidden ... Xiangru Tang and Leandro von Werra and Shayne Longpre<br/>
- Self: Language-driven self-evolution for large language model: 2023<br/> Lu, Jianqiao and Zhong, Wanjun and Huang, Wenyong... 3 hidden ... Wang, Weichao and Shang, Lifeng and Liu, Qun<br/>
- The Flan Collection: Designing Data and Methods for Effective Instruction Tuning: 2023<br/> Longpre, Shayne and Hou, Le and Vu, Tu... 5 hidden ... Zoph, Barret and Wei, Jason and Roberts, Adam<br/>
- #InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models: 2023<br/> Keming Lu and Hongyi Yuan and Zheng Yuan... 2 hidden ... Chuanqi Tan and Chang Zhou and Jingren Zhou<br/>
- Instruction Mining: When Data Mining Meets Large Language Model Finetuning: 2023<br/> Yihan Cao and Yanbin Kang and Chi Wang and Lichao Sun<br/>
- Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks: 2023<br/> Po-Nien Kung and Fan Yin and Di Wu and Kai-Wei Chang and Nanyun Peng<br/>
- The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI: 2023<br/> Longpre, Shayne and Mahari, Robert and Chen, Anthony... 5 hidden ... Kabbara, Jad and Perisetla, Kartik and others<br/>
- Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning: 2024<br/> Shivalika Singh and Freddie Vargus and Daniel Dsouza... 27 hidden ... Ahmet Üstün and Marzieh Fadaee and Sara Hooker<br/>
- Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models: 2024<br/> Zhuo, Terry Yue and Zebaze, Armel and Suppattarachai, Nitchakarn... 1 hidden ... de Vries, Harm and Liu, Qian and Muennighoff, Niklas<br/>
- Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model: 2024<br/> "Ust"un, Ahmet and Aryabumi, Viraat and Yong, Zheng-Xin... 5 hidden ... Ooi, Hui-Lee and Kayid, Amr and others<br/>
- Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models: 2024<br/> Dheeraj Mekala and Alex Nguyen and Jingbo Shang<br/>
- Automated Data Curation for Robust Language Model Fine-Tuning: 2024<br/> Jihai Chen and Jonas Mueller<br/>
Data Selection for Preference Fine-tuning: Alignment
<img src="learning-stages-alignment.png" alt="Conceptualization of objectives and constraints on data selection for alignment" width=50% align="right">- WebGPT: Browser-assisted question-answering with human feedback: 2021<br/> Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir... 5 hidden ... Kosaraju, Vineet and Saunders, William and others<br/>
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback: 2022<br/> Bai, Yuntao and Jones, Andy and Ndousse, Kamal... 5 hidden ... Ganguli, Deep and Henighan, Tom and others<br/>
- Understanding Dataset Difficulty with $mathcalV$-Usable Information: 2022<br/> Ethayarajh, Kawin and Choi, Yejin and Swayamdipta, Swabha<br/>
- Constitutional AI: Harmlessness from AI Feedback: 2022<br/> Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan... 5 hidden ... Mirhoseini, Azalia and McKinnon, Cameron and others<br/>
- Prometheus: Inducing fine-grained evaluation capability in language models: 2023<br/> Kim, Seungone and Shin, Jamin and Cho, Yejin... 5 hidden ... Kim, Sungdong and Thorne, James and others<br/>
- Notus: 2023<br/> Alvaro Bartolome and Gabriel Martin and Daniel Vila<br/>
- UltraFeedback: Boosting Language Models with High-quality Feedback: 2023<br/> Ganqu Cui and Lifan Yuan and Ning Ding... 3 hidden ... Guotong Xie and Zhiyuan Liu and Maosong Sun<br/>
- Exploration with Principles for Diverse AI Supervision: 2023<br/> Liu, Hao and Zaharia, Matei and Abbeel, Pieter<br/>
- Wizardlm: Empowering large language models to follow complex instructions: 2023<br/> Xu, Can and Sun, Qingfeng and Zheng, Kai... 2 hidden ... Feng, Jiazhan and Tao, Chongyang and Jiang, Daxin<br/>
- LIMA: Less Is More for Alignment: 2023<br/> Chunting Zhou and Pengfei Liu and Puxin Xu... 9 hidden ... Mike Lewis and Luke Zettlemoyer and Omer Levy<br/>
- Shepherd: A Critic for Language Model Generation: 2023<br/> Tianlu Wang and Ping Yu and Xiaoqing Ellen Tan... 4 hidden ... Luke Zettlemoyer and Maryam Fazel-Zarandi and Asli Celikyilmaz<br/>
- No Robots: 2023<br/> Nazneen Rajani and Lewis Tunstall and Edward Beeching and Nathan Lambert and Alexander M. Rush and Thomas Wolf<br/>
- Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF: 2023<br/> Zhu, Banghua and Frick, Evan and Wu, Tianhao and Zhu, Hanlin and Jiao, Jiantao<br/>
- Scaling laws for reward model overoptimization: 2023<br/> Gao, Leo and Schulman, John and Hilton, Jacob<br/>
- SALMON: Self-Alignment with Principle-Following Reward Models: 2023<br/> Zhiqing Sun and Yikang Shen and Hongxin Zhang... 2 hidden ... David Cox and Yiming Yang and Chuang Gan<br/>
- Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback: 2023<br/> Stephen Casper and Xander Davies and Claudia Shi... 26 hidden ... David Krueger and Dorsa Sadigh and Dylan Hadfield-Menell<br/>
- Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2: 2023<br/> Hamish Ivison and Yizhong Wang and Valentina Pyatkin... 5 hidden ... Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi<br/>
- Llama 2: Open Foundation and Fine-Tuned Chat Models: 2023<br/> Hugo Touvron and Louis Martin and Kevin Stone... 62 hidden ... Robert Stojnic and Sergey Edunov and Thomas Scialom<br/>
- What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning: 2023<br/> Liu, Wei and Zeng, Weihao and He, Keqing and Jiang, Yong and He, Junxian<br/>
- HuggingFace H4 Stack Exchange Preference Dataset: 2023<br/> Lambert, Nathan and Tunstall, Lewis and Rajani, Nazneen and Thrush, Tristan<br/>
- Textbooks Are All You Need: 2023<br/> Gunasekar, Suriya and Zhang, Yi and Aneja, Jyoti... 5 hidden ... de Rosa, Gustavo and Saarikivi, Olli and others<br/>
- Quality-Diversity through AI Feedback: 2023<br/> Herbie Bradley and Andrew Dai and Hannah Teufel... 4 hidden ... Kenneth Stanley and Grégory Schott and Joel Lehman<br/>
- Direct preference optimization: Your language model is secretly a reward model: 2023<br/> Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea<br/>
- Scaling relationship on learning mathematical reasoning with large language models: 2023<br/> Yuan, Zheng and Yuan, Hongyi and Li, Chengpeng and Dong, Guanting and Tan, Chuanqi and Zhou, Chang<br/>
- The History and Risks of Reinforcement Learning and Human Feedback: 2023<br/> Lambert, Nathan and Gilbert, Thomas Krendl and Zick, Tom<br/>
- Zephyr: Direct distillation of lm alignment: 2023<br/> Tunstall, Lewis and Beeching, Edward and Lambert, Nathan... 5 hidden ... Fourrier, Cl'ementine and Habib, Nathan and others<br/>
- Perils of Self-Feedback: Self-Bias Amplifies in Large Language Models: 2024<br/> Wenda Xu and Guanglei Zhu and Xuandong Zhao and Liangming Pan and Lei Li and William Yang Wang<br/>
- Suppressing Pink Elephants with Direct Principle Feedback: 2024<br/> Louis Castricato and Nathan Lile and Suraj Anand and Hailey Schoelkopf and Siddharth Verma and Stella Biderman<br/>
- West-of-N: Synthetic Preference Generation for Improved Reward Modeling: 2024<br/> Alizée Pace and Jonathan Mallinson and Eric Malmi and Sebastian Krause and Aliaksei Severyn<br/>
- Statistical Rejection Sampling Improves Preference Optimization: 2024<br/> Liu, Tianqi and Zhao, Yao and Joshi, Rishabh... 1 hidden ... Saleh, Mohammad and Liu, Peter J and Liu, Jialu<br/>
- Self-play fine-tuning converts weak language models to strong language models: 2024<br/> Chen, Zixiang and Deng, Yihe and Yuan, Huizhuo and Ji, Kaixuan and Gu, Quanquan<br/>
- Self-Rewarding Language Models: 2024<br/> Weizhe Yuan and Richard Yuanzhe Pang and Kyunghyun Cho and Sainbayar Sukhbaatar and Jing Xu and Jason Weston<br/>
- Theoretical guarantees on the best-of-n alignment policy: 2024<br/> Beirami, Ahmad and Agarwal, Alekh and Berant, Jonathan... 1 hidden ... Eisenstein, Jacob and Nagpal, Chirag and Suresh, Ananda Theertha<br/>
- KTO: Model Alignment as Prospect Theoretic Optimization: 2024<br/> Ethayarajh, Kawin and Xu, Winnie and Muennighoff, Niklas and Jurafsky, Dan and Kiela, Douwe<br/>
Data Selection for In-Context Learning
<img src="learning-stages-icl.png" alt="Conceptualization of objectives and constraints on data selection for in-context learning" width=50% align="right">- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks: 2019<br/> Reimers, Nils and Gurevych, Iryna<br/>
- Language Models are Few-Shot Learners: 2020<br/> Brown, Tom and Mann, Benjamin and Ryder, Nick... 25 hidden ... Radford, Alec and Sutskever, Ilya and Amodei, Dario<br/>
- True Few-Shot Learning with Language Models: 2021<br/> Ethan Perez and Douwe Kiela and Kyunghyun Cho<br/>
- Active Example Selection for In-Context Learning: 2022<br/> Zhang, Yiming and Feng, Shi and Tan, Chenhao<br/>
- Careful Data Curation Stabilizes In-context Learning: 2022<br/> Chang, Ting-Yun and Jia, Robin<br/>
- Learning To Retrieve Prompts for In-Context Learning: 2022<br/> Rubin, Ohad and Herzig, Jonathan and Berant, Jonathan<br/>
- Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity: 2022<br/> Lu, Yao and Bartolo, Max and Moore, Alastair and Riedel, Sebastian and Stenetorp, Pontus<br/>
- What Makes Good In-Context Examples for GPT-3?: 2022<br/> Liu, Jiachang and Shen, Dinghan and Zhang, Yizhe and Dolan, Bill and Carin, Lawrence and Chen, Weizhu<br/>
- MetaICL: Learning to Learn In Context: 2022<br/> Min, Sewon and Lewis, Mike and Zettlemoyer, Luke and Hajishirzi, Hannaneh<br/>
- Unified Demonstration Retriever for In-Context Learning: 2023<br/> Li, Xiaonan and Lv, Kai and Yan, Hang... 3 hidden ... Xie, Guotong and Wang, Xiaoling and Qiu, Xipeng<br/>
- Which Examples to Annotate for In-Context Learning? Towards Effective and Efficient Selection: 2023<br/> Mavromatis, Costas and Srinivasan, Balasubramaniam and Shen, Zhengyuan... 1 hidden ... Rangwala, Huzefa and Faloutsos, Christos and Karypis, George<br/>
- Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning: 2023<br/> Xinyi Wang and Wanrong Zhu and Michael Saxon and Mark Steyvers and William Yang Wang<br/>
- Selective Annotation Makes Language Models Better Few-Shot Learners: 2023<br/> Hongjin SU and Jungo Kasai and Chen Henry Wu... 5 hidden ... Luke Zettlemoyer and Noah A. Smith and Tao Yu<br/>
- In-context Example Selection with Influences: 2023<br/> Nguyen, Tai and Wong, Eric<br/>
- Coverage-based Example Selection for In-Context Learning: 2023<br/> Gupta, Shivanshu and Singh, Sameer and Gardner, Matt<br/>
- Compositional exemplars for in-context learning: 2023<br/> Ye, Jiacheng and Wu, Zhiyong and Feng, Jiangtao and Yu, Tao and Kong, Lingpeng<br/>
- Take one step at a time to know incremental utility of demonstration: An analysis on reranking for few-shot in-context learning: 2023<br/> Hashimoto, Kazuma and Raman, Karthik and Bendersky, Michael<br/>
- Ambiguity-aware in-context learning with large language models: 2023<br/> Gao, Lingyu and Chaudhary, Aditi and Srinivasan, Krishna and Hashimoto, Kazuma and Raman, Karthik and Bendersky, Michael<br/>
- IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models: 2023<br/> Zhang, Shaokun and Xia, Xiaobo and Wang, Zhaoqing... 1 hidden ... Liu, Jiale and Wu, Qingyun and Liu, Tongliang<br/>
- ScatterShot: Interactive In-context Example Curation for Text Transformation: 2023<br/> Wu, Sherry and Shen, Hua and Weld, Daniel S and Heer, Jeffrey and Ribeiro, Marco Tulio<br/>
- Diverse Demonstrations Improve In-context Compositional Generalization: 2023<br/> Levy, Itay and Bogin, Ben and Berant, Jonathan<br/>
- Finding supporting examples for in-context learning: 2023<br/> Li, Xiaonan and Qiu, Xipeng<br/>
- Misconfidence-based Demonstration Selection for LLM In-Context Learning: 2024<br/> Xu, Shangqing and Zhang, Chao<br/>
- In-context Learning with Retrieved Demonstrations for Language Models: A Survey: 2024<br/> Xu, Xin and Liu, Yue and Pasupat, Panupong and Kazemi, Mehran and others<br/>
Data Selection for Task-specific Fine-tuning
<img src="learning-stages-finetuning.png" alt="Conceptualization of objectives and constraints on data selection for task-specific fine-tuning" width=50% align="right">- A large annotated corpus for learning natural language inference: 2015<br/> Bowman, Samuel R. and Angeli, Gabor and Potts, Christopher and Manning, Christopher D.<br/>
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding: 2018<br/> Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel<br/>
- A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference: 2018<br/> Williams, Adina and Nangia, Nikita and Bowman, Samuel<br/>
- Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks: 2019<br/> Jason Phang and Thibault Févry and Samuel R. Bowman<br/>
- Distributionally Robust Neural Networks: 2020<br/> Shiori Sagawa and Pang Wei Koh and Tatsunori B. Hashimoto and Percy Liang<br/>
- Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics: 2020<br/> Swayamdipta, Swabha and Schwartz, Roy and Lourie, Nicholas... 1 hidden ... Hajishirzi, Hannaneh and Smith, Noah A. and Choi, Yejin<br/>
- Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?: 2020<br/> Pruksachatkun, Yada and Phang, Jason and Liu, Haokun... 3 hidden ... Vania, Clara and Kann, Katharina and Bowman, Samuel R.<br/>
- On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation: 2021<br/> Dan Iter and David Grangier<br/>
- FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue: 2022<br/> Albalak, Alon and Tuan, Yi-Lin and Jandaghi, Pegah... 3 hidden ... Getoor, Lise and Pujara, Jay and Wang, William Yang<br/>
- LoRA: Low-Rank Adaptation of Large Language Models: 2022<br/> Edward J Hu and yelong shen and Phillip Wallis... 2 hidden ... Shean Wang and Lu Wang and Weizhu Chen<br/>
- Training Subset Selection for Weak Supervision: 2022<br/> Lang, Hunter and Vijayaraghavan, Aravindan and Sontag, David<br/>
- On-Demand Sampling: Learning Optimally from Multiple Distributions: 2022<br/> Haghtalab, Nika and Jordan, Michael and Zhao, Eric<br/>
- The Trade-offs of Domain Adaptation for Neural Language Models: 2022<br/> Grangier, David and Iter, Dan<br/>
- Data Pruning for Efficient Model Pruning in Neural Machine Translation: 2023<br/> Azeemi, Abdul and Qazi, Ihsan and Raza, Agha<br/>
- Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models: 2023<br/> Mayee F. Chen and Nicholas Roberts and Kush Bhatia... 1 hidden ... Ce Zhang and Frederic Sala and Christopher Ré<br/>
- D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning: 2023<br/> Adyasha Maharana and Prateek Yadav and Mohit Bansal<br/>
- Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data: 2023<br/> Alon Albalak and Colin Raffel and William Yang Wang<br/>
- Efficient Online Data Mixing For Language Model Pre-Training: 2023<br/> Alon Albalak and Liangming Pan and Colin Raffel and William Yang Wang<br/>
- Data-Efficient Finetuning Using Cross-Task Nearest Neighbors: 2023<br/> Ivison, Hamish and Smith, Noah A. and Hajishirzi, Hannaneh and Dasigi, Pradeep<br/>
- Make Every Example Count: On the Stability and Utility of Self-Influence for Learning from Noisy NLP Datasets: 2023<br/> Bejan, Irina and Sokolov, Artem and Filippova, Katja<br/>
- LESS: Selecting Influential Data for Targeted Instruction Tuning: 2024<br/> Mengzhou Xia and Sadhika Malladi and Suchin Gururangan and Sanjeev Arora and Danqi Chen<br/>
Contribution
There are likely some amazing works in the field that we missed, so please contribute to the repo.
Feel free to open a pull request with new papers or create an issue and we can add them for you. Thank you in advance for your efforts!
Citation
We hope this work serves as inspiration for many impactful future works. If you found our work useful, please cite this paper as:
@article{albalak2024survey,
title={A Survey on Data Selection for Language Models},
author={Alon Albalak and Yanai Elazar and Sang Michael Xie and Shayne Longpre and Nathan Lambert and Xinyi Wang and Niklas Muennighoff and Bairu Hou and Liangming Pan and Haewon Jeong and Colin Raffel and Shiyu Chang and Tatsunori Hashimoto and William Yang Wang},
year={2024},
journal={arXiv preprint arXiv:2402.16827},
note={\url{https://arxiv.org/abs/2402.16827}}
}