Awesome

Awesome-instruction-tuning

A curated list of open-source instruction tuning datasets, models, papers, repositories.

Datasets and Models

Modified from Traditional NLP

Following Longpre et al., we list all existing instruction tuning datasets modified from traditional NLP tasks.

Release	Datasets	Number of Tasks	Number of Instances	Model_name	Base	Model_Size
2020-05	UnifiedQA	46	750k	UnifiedQA	RoBerta	110-340 M
2021-04	CrossFit	159	71.M	BART-CrossFit	BART	140 M
2021-04	Natural Inst v1.0	61	620 k	Gen. BART	BART	140 M
2021-09	Flan 2021	62	4.4M	Flan-LaMDA	LaMDA	137B
2021-10	P3	62	12M	TO, TO+, TO++	T5-LM	3-11B
2021-10	MetalCL	142	3.5M	MetalCL	GPT-2	770 M
2021-11	ExMix	107	500 k	ExT5	T5	220M-11B
2022-04	Super-Natural Inst.	1613	5M	Tk-Instruct	T5-LM, mT5	17-13B
2022-10	GLM	77	12M	GLM-130B	GLM	130 B
2022-10	Flan 2022	1836	15M	Flan-T5, Flan-PaLM	T5-LM, PaLM	10 M-540 B
2022-11	xP3	71	81M	BLOOMz, mTO	BLOOM, mT5	13-176B
2022-12	Unnatural Inst.	117	64 k	T5-LM-Unnat. Inst.	T5-LM	11B

Generated by LLMs

Release	Model_name	Base	Model_Size	Datasets	Number of Instances	Language
2022-12	GPT-3 Self Inst.	GPT-3	175B	Self-Instruct	82 k	En
2023-03-03	alpaca	LLaMA	7B	alpaca_data	52 k	En
2023-03-19	alpaca-lora	LLaMA	7B 13B 30B	alpaca_data、alpaca_data_cleaned	52 k	En
2023-03-23	Chinese-Vicuna	LLaMA	7B 13B	BELLE、GuanacoDataset	1M	Zh
2023-03-24	Alpaca-CoT	LLaMA	7B	dataset	----	En Zh
2023-03-25	dolly	dolly	6B	alpaca_data	52 k	En
2023-03-25	guanaco	LLaMA	7B	GuanacoDataset	534 k	En Zh Ja De
2023-03-28	Chinese-LLaMA-Alpaca	LLaMA	7B	alpaca_data_zh、pCLUE、translation2019zh、alpaca_data、Self-Instruct	2M	Zh
2023-03-29	ColossalChat	LLaMA	7B 13B	InstructionWild	104 k	En Zh
2023-03-31	Luotuo	LLaMA ChatGLM	7B 6B	trans_chinese_alpaca_data	52k	Zh
2023-03-31	cerebras-lora-alpaca	Cerebras-GPT	2.7B	AlpacaDataCleaned	52k	En

Multilingual tools

Most existing datasets are in English. However, most of the world’s population is under-served in terms of availability of data for their languages. How to ensure that everyone across the world is able to benefit from generative AI ? We have developed a straightforward and open-source translation tool based on Helsinki-NLP, capable of translating English datasets into 100+ languages at no cost. Although these translated datasets may contain some noise, they serve as a viable alternative to costly, high-quality data. See below.

Use of translator.py:

python  translator.py  model_name  source_data_path

Example:

python  translator.py  Helsinki-NLP/opus-mt-en-zh  alpaca_data.json

Our tool is designed to work with alpaca data and the Helsinki-NLP/opus-mt-en-zh model. Different datasets or Helsinki-NLP models yield varying results. Due to the limitations of the model, Constrained by the model's capabilities, the translation quality may not always be optimal. For example，we observed instances of repeated words in the translations from English to Chinese，which lead us to develop "process.py" to eliminate translated prompts containing strings of any length that appear three consecutive times. We provide the final version in "translated_alpaca_data.json".

Use of process.py:

python  process.py  unprocessed_data_path

Example:

python  process.py  translated_data.json

# the Helsinki-NLP model may have a maximum input sentence length limit. We have discarded the prompts which exceed the limit before translate them.

Papers

We have extensively reviewed papers in this field and have listed the most valuable ones below:

Finetuned language models are zero-shot learners 2021.9

Multitask Prompted Training Enables Zero-Shot Task Generalization 2021.10

Training language models to follow instructions with human feedback 2022.3

Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks 2022.4

Unsupervised Cross-Task Generalization via Retrieval Augmentation 2022.4

Instruction Induction: From Few Examples to Natural Language Task Descriptions 2022.5

Scaling Instruction-Finetuned Language Models 2022.10

Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners 2022.10

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor 2022.12

Improving Cross-task Generalization of Unified Table-to-text Models with Compositional Task Configurations 2022.12

Self-Instruct: Aligning Language Model with Self Generated Instructions 2022.12

MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning 2022.12

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning 2023.1

In-Context Instruction Learning 2023.2

Repositories

Additionally, we have provided a list of related repositories for further reference.

Awesome

Awesome-instruction-tuning

Datasets and Models

Modified from Traditional NLP

Generated by LLMs

Multilingual tools

Use of translator.py:

Example:

Use of process.py:

Example:

Papers

Repositories

Instruction

ICL

Reason

Framework