Home

Awesome

<p align="center"> <img src="assets/icon.png" width="150"> <br /> <br /> <a href="https://huggingface.co/datasets/microsoft/RedStone"><img alt="MIT License" src="https://img.shields.io/badge/Hugging%20Face-Data Index-orange?logo=huggingface" /></a> <a href="https://arxiv.org/abs/2412.03398"><img alt="MIT License" src="https://img.shields.io/badge/ArXiv-2412.03398-green.svg" /></a> <a href="https://github.com/microsoft/RedStone/blob/main/LICENSE"><img alt="MIT License" src="https://img.shields.io/badge/license-MIT-blue.svg" /></a> </p>

REDSTONE : Curating General, Code, Math, and QA Data for Large Language Models

RedStone is an innovative and scalable pipeline designed to extract and process data from a vast amount of web content, facilitating the creation of diverse and comprehensive pre-training datasets. We demonstrate its capabilities by building pre-training datasets across multiple domains, including general, code, mathematics, and question-answering. REDSTONE's flexibility allows it to easily adapt to various specialized fields.

Dataset

DatasetsTokens (B)
REDSTONE-Web3,170.2
REDSTONE-Code250.2
REDSTONE-Math15.9
REDSTONE-QA51.4

Note: Since we do not have the permission to open-source the processed data, We provide all the code for RedStone to process both general and domain-specific data, along with an index for high-quality data from Common Crawl after filtering. You can download the raw Common Crawl data, use the provided index to find high-quality pages, and process them with RedStone's scripts.

If you have the appropriate licenses, we encourage you to use these scripts to reproduce the dataset and contribute it to the open-source community. We will reference the data here for easy access. Additionally, we welcome you to use RedStone to expand domain-specific categories beyond just code, math, and QA.

Performance

General Domain Data

DatasetsARC-cARC-eHellaSwagOpenBookQAPIQAWinograndeAVERAGE
RedPajama0.22700.43860.31710.19000.59680.52960.3832
FineWeb0.19280.44280.35060.17400.66810.52880.3929
RefinedWeb0.21250.43690.33800.21000.64910.52640.3955
DCLM0.21590.48480.36140.17600.66150.50820.4013
FineWeb-Edu0.27220.56480.36370.19400.66760.50510.4279
REDSTONE-Web0.26620.51810.37220.23400.67950.51620.4310

<sub>The results are based on models trained with 1.3 billion parameters on 50 billion tokens.</sub>

Domain-specific Data

REDSTONE-Code

DatasetHumanEval pass@1HumanEval pass@10MBPP pass@1MBPP pass@10
REDSTONE-Web0.01250.01680.07510.1566
+ REDSTONE-Code0.05550.10350.13110.2458

REDSTONE-Math

DatasetGSM8kMATH
OpenWebMath3.25033.1288
REDSTONE-Math3.11253.0557

REDSTONE-QA

ModelMMLUArc ChallengeArc EasyOpenbookQAWinograndeAVERAGE
StableLM-2-1.6B0.31350.34810.68600.27800.63540.4522
+ FALN v20.35250.36010.64060.28600.61250.4503
+ Open Orca0.35690.30890.58210.26600.56750.4163
+ REDSTONE-QA0.45820.36430.68390.27600.63770.4840

<sub>For evaluations on the domain-specific dataset, We utilized the same architecture as the StableLM-2-1.6B</sub>

Getting Started

DomainLink
General Domain DataGetting Started
Domain-specific DataGetting Started

Responsible AI FAQ

Citation

If you find this repository useful, please consider citing our work:

@article{redstone,
  title={{RedStone}: {Curating} General, Code, Math, and {QA} Data for Large Language Models},
  author={Chang, Yaoyao and Cui, Lei and Dong, Li and Huang, Shaohan and Huang, Yangyu and Huang, Yupan and Li, Scarlett and Lv, Tengchao and Ma, Shuming and Sun, Qinzheng and others},
  journal={arXiv preprint arXiv:2412.03398},
  year={2024}
}

License

The content of this project itself is licensed under the MIT

Microsoft Open Source Code of Conduct

Contact

For help or issues using RedStone, please submit a GitHub issue.

For other communications related to RedStone, please contact Lei Cui or Furu Wei.