Home

Awesome

Awesome-instruction-tuning

A curated list of open-source instruction tuning datasets, models, papers, repositories.

Datasets and Models

Modified from Traditional NLP

Following Longpre et al., we list all existing instruction tuning datasets modified from traditional NLP tasks.

ReleaseDatasetsNumber of TasksNumber of InstancesModel_nameBaseModel_Size
2020-05UnifiedQA46750kUnifiedQARoBerta110-340 M
2021-04CrossFit15971.MBART-CrossFitBART140 M
2021-04Natural Inst v1.061620 kGen. BARTBART140 M
2021-09Flan 2021624.4MFlan-LaMDALaMDA137B
2021-10P36212MTO, TO+, TO++T5-LM3-11B
2021-10MetalCL1423.5MMetalCLGPT-2770 M
2021-11ExMix107500 kExT5T5220M-11B
2022-04Super-Natural Inst.16135MTk-InstructT5-LM, mT517-13B
2022-10GLM7712MGLM-130BGLM130 B
2022-10Flan 2022183615MFlan-T5, Flan-PaLMT5-LM, PaLM10 M-540 B
2022-11xP37181MBLOOMz, mTOBLOOM, mT513-176B
2022-12Unnatural Inst.11764 kT5-LM-Unnat. Inst.T5-LM11B

Generated by LLMs

ReleaseModel_nameBaseModel_SizeDatasetsNumber of InstancesLanguage
2022-12GPT-3 Self Inst.GPT-3175BSelf-Instruct82 kEn
2023-03-03alpacaLLaMA7Balpaca_data52 kEn
2023-03-19alpaca-loraLLaMA7B 13B 30Balpaca_dataalpaca_data_cleaned52 kEn
2023-03-23Chinese-VicunaLLaMA7B 13BBELLEGuanacoDataset1MZh
2023-03-24Alpaca-CoTLLaMA7Bdataset----En Zh
2023-03-25dollydolly6Balpaca_data52 kEn
2023-03-25guanacoLLaMA7BGuanacoDataset534 kEn Zh Ja De
2023-03-28Chinese-LLaMA-AlpacaLLaMA7Balpaca_data_zhpCLUEtranslation2019zhalpaca_data、Self-Instruct2MZh
2023-03-29ColossalChatLLaMA7B 13BInstructionWild104 kEn Zh
2023-03-31LuotuoLLaMA ChatGLM7B 6Btrans_chinese_alpaca_data52kZh
2023-03-31cerebras-lora-alpacaCerebras-GPT2.7BAlpacaDataCleaned52kEn

Multilingual tools

Most existing datasets are in English. However, most of the world’s population is under-served in terms of availability of data for their languages. How to ensure that everyone across the world is able to benefit from generative AI ? We have developed a straightforward and open-source translation tool based on Helsinki-NLP, capable of translating English datasets into 100+ languages at no cost. Although these translated datasets may contain some noise, they serve as a viable alternative to costly, high-quality data. See below.

Use of translator.py:

python  translator.py  model_name  source_data_path

Example:

python  translator.py  Helsinki-NLP/opus-mt-en-zh  alpaca_data.json

Our tool is designed to work with alpaca data and the Helsinki-NLP/opus-mt-en-zh model. Different datasets or Helsinki-NLP models yield varying results. Due to the limitations of the model, Constrained by the model's capabilities, the translation quality may not always be optimal. For example,we observed instances of repeated words in the translations from English to Chinese,which lead us to develop "process.py" to eliminate translated prompts containing strings of any length that appear three consecutive times. We provide the final version in "translated_alpaca_data.json".

Use of process.py:

python  process.py  unprocessed_data_path

Example:

python  process.py  translated_data.json

# the Helsinki-NLP model may have a maximum input sentence length limit. We have discarded the prompts which exceed the limit before translate them.

Papers

We have extensively reviewed papers in this field and have listed the most valuable ones below:

Finetuned language models are zero-shot learners 2021.9

Multitask Prompted Training Enables Zero-Shot Task Generalization 2021.10

Training language models to follow instructions with human feedback 2022.3

Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks 2022.4

Unsupervised Cross-Task Generalization via Retrieval Augmentation 2022.4

Instruction Induction: From Few Examples to Natural Language Task Descriptions 2022.5

Scaling Instruction-Finetuned Language Models 2022.10

Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners 2022.10

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor 2022.12

Improving Cross-task Generalization of Unified Table-to-text Models with Compositional Task Configurations 2022.12

Self-Instruct: Aligning Language Model with Self Generated Instructions 2022.12

MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning 2022.12

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning 2023.1

In-Context Instruction Learning 2023.2

Repositories

Additionally, we have provided a list of related repositories for further reference.

Instruction

awesome-instruction-learning

awesome-instruction-dataset

ICL

ICL_PaperList

prompt-in-context-learning

Reason

LM-reasoning

LLM-Reasoning-Papers

Chain-of-ThoughtsPapers

Framework

OpenICL