Home

Awesome

Infinity Instruct

<p align="center"> <img src="static/Bk3NbjnJko51MTx1ZCScT2sqnGg.png" width="300"> </p> <p align="center"> <em>Beijing Academy of Artificial Intelligence (BAAI)</em><br/> <em>[Paper][Code][🤗] (would be released soon)</em> </p>

The quality and scale of instruction data are crucial for model performance. Recently, open-source models have increasingly relied on fine-tuning datasets comprising millions of instances, necessitating both high quality and large scale. However, the open-source community has long been constrained by the high costs associated with building such extensive and high-quality instruction fine-tuning datasets, which has limited related research and applications. To address this gap, we are introducing the Infinity Instruct project, aiming to develop a large-scale, high-quality instruction dataset.

News

Flopsera [http://open.flopsera.com/flopsera-open/details/InfinityInstruct]

huggingface[https://huggingface.co/datasets/BAAI/Infinity-Instruct]

GPT-4 automatic evaluation

ModelMT-BenchAlpacaEval2.0
OpenHermes-2.5-Mistral-7B7.516.2
Mistral-7B-Instruct-v0.27.617.1
Llama-3-8B-Instruct*8.122.9
GPT 3.5 Turbo 06138.422.7
Mixtral 8x7B v0.18.323.7
Gemini Pro--24.4
InfInstruct-3M-Mistral-7B7.314.3
InfInstruct-Mistral-7B 06087.816.9
InfInstruct-Mistral-7B 06127.925.1
GPT-4-06139.230.2
Llama-3-70B-Instruct*9.034.4
InfInstruct-3M-Llama-3-70B8.421.8
InfInstruct-Llama-3-70B 06088.927.1
InfInstruct-Llama-3-70B 06128.630.7
InfInstruct-Llama-3-70B 06138.731.5

*denotes the results come from web

Performance on Downstream tasks

ModelMMLUGSM8KHumanEvalHellaSwagAverage
Mistral-7B56.548.114.035.538.5
Mistral-7B-Instruct-v0.259.645.932.964.450.7
OpenHermes-2.5-Mistral-7B61.773.041.580.664.2
InfInstruct-3M-Mistral-7B62.978.150.684.869.1

Overview of Infinity Instruct

Data sources

We collect large-scale instruct data from the open-source community. The data sources are listed as follows:

Raw DatasetNumbers of Rows
glaiveai/glaive-code-assistant-v3138157
Replete-AI/code_bagel_hermes-2.5506346
m-a-p/CodeFeedback-Filtered-Instruction104848
bigcode/self-oss-instruct-sc2-exec-filter-50k50661
codefuse-ai/CodeExercise-Python-27k27224
nickrosh/Evol-Instruct-Code-80k-v178264
TIGER-Lab/MathInstruct188486
microsoft/orca-math-word-problems-200k200035
MetaMathQa395000
teknium/Openhermes-2.51001551
Math320130
Selected subjective instructions1362000
Summary4372702
Raw DatasetNumbers of Rows
Alpaca GPT4 data13490
Alpaca GPT4 data zh32589
Baize14906
BELLE Generated Chat43775
BELLE Multiturn Chat210685
BELLE 3.5M CN312598
databricks-dolly-15K10307
LIMA-sft712
CodeContest523
LongForm3290
ShareGPT-Chinese-English-90k8919
UltraChat276345
Wizard evol instruct zh44738
Wizard evol instruct 196K88681
BELLE School Math38329
Code Alpaca 20K13296
WildChat61873
COIG-CQIA45793
BAGEL55193
DEITA10000
Math320130
Summary1362000

The domain distribution of the subjective instruction category are shown in the following picture.

Instruction Selection for downstream tasks

To create an objective ranking, we utilize datasets such as Flan and OpenHermes, with a focus on enhancing code and math capabilities. The method includes detailed topic distribution tagging of the evaluation set (e.g., data structures, sorting in humaneval). We apply heuristic rules to filter out irrelevant data based on the dataset source (e.g., removing network or file I/O operations). We further retrieve a subset from the training set based on the distribution in the validation sets.

Instruction Generation for High-Quality Response

High-Quality Open Source Instruction Collection and Tag System

We start by collecting high-quality open-source instruction sets. We assign each instruction in the collection a set of tags that describe the abilities and knowledge necessary to complete the instruction. With this tagging system, we can recognize the content distribution of the collection and the abilities required for completing different tasks.

Informative Instruction Selection

Aimed at selecting most informative instructions from the whole collection for enhancing the performance of LLM and improving user experience.

Instruction Generation by Data Evolution Strategy

We expand the seed instructions in directions breadth, depth, difficulty, and complexity with a method built based on [2], and use AI assistants to generate multi-turn data.

Instruction Generation by Model Ability Deficient Diagnosis

Automatically identifying weaknesses in the model's capabilities to guide the synthesis of data.

Disclaimer

The resources, including code, data, and model weights, associated with this project are restricted for academic research purposes only and cannot be used for commercial purposes. The content produced by any version of Infinity Instruct is influenced by uncontrollable variables such as randomness, and therefore, the accuracy of the output cannot be guaranteed by this project. This project does not accept any legal liability for the content of the model output, nor does it assume responsibility for any losses incurred due to the use of associated resources and output results.

Reference

[1] Li M, Zhang Y, He S, et al. Superfiltering: Weak-to-strong data filtering for fast instruction-tuning[J]. arXiv preprint arXiv:2402.00530, 2024.

[2] Xu C, Sun Q, Zheng K, et al. WizardLM: Empowering large pre-trained language models to follow complex instructions[C]//The Twelfth International Conference on Learning Representations. 2023.

Citation

Our paper, detailing the development and features of the Infinity Instruct dataset, will be released soon on arXiv. Stay tuned!

@article{InfinityInstruct2024,
  title={Infinity Instruct},
  author={Beijing Academy of Artificial Intelligence (BAAI)},
  journal={arXiv preprint arXiv:2406.XXXX},
  year={2024}
}