Home

Awesome

Generative AI for Math: MathPile

This is the official repository for Generative AI for Math: Part I - MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Homepage | Datasets | Paper | Limitations | Statement & License | Citation | Featured By AK

Please be aware that our corpus could be updated (we will notify upon release). It is advisable to use the latest version.

🔥News

🚀Introduction

High-quality, large-scale corpora are the cornerstone of building powerful foundation models. In this work, we introduce MathPile a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. our work is significantly different from the previous work in the following characteristics:

<div align="center"> <!-- <img src=https://github.com/GAIR-NLP/MathPile/assets/46218454/028361f3-c70b-4787-b718-6af9e06aafa8 width=45%/> --> <img src="./static/images/mathpile-features.png" width=45%/> </div> <div align="center"> <!-- <img src=https://github.com/GAIR-NLP/MathPile/assets/46218454/8486ce7a-3036-4ede-867f-9e61038dcc70 width=70%/> --> <img src="./static/images/mathpile-overview.png" width=75%/> </div> <p>

We hope our MathPile can help to enhance the mathematical reasoning abilities of language models. See our paper for more technical details.

😋Limitations

👊Statements & License

If the source data of MathPile is governed by a license more restrictive than CC BY-NC-SA 4.0, MathPile adheres to that stricter licensing. In all other cases, it operates under the CC BY-NC-SA 4.0 license. We also plan to release a commercially usable version of the dataset soon.

🌟Projects Using MathPile

Below are some projects that use MathPile, covering scenarios including but not limited to pre-training, data synthesis, and benchmarking:

🥳Citation

If you find our work useful or use MathPile, please cite our paper:

@article{wang2023mathpile,
      title={Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math},
      author={Wang, Zengzhi and Xia, Rui and Liu, Pengfei},
      journal={arXiv preprint arXiv:2312.17120},
      year={2023}
}