Awesome
Data Mixing Laws: Optimizing Data Mixture by Predicting Language Modeling Performance
Code and data for "Data Mixing Laws: Optimizing Data Mixture by Predicting Language Modeling Performance"
Data Mixing Laws
We include the codes to reproduce experiments and figures to discover data mixing laws in
mix_2_domains.ipynb
: two training domains, single validation domainmix_3_domains.ipynb
: multiple training domains, single validation domainmix_5_domains.ipynb
: multiple training domains, multiple validation domains
Prediction Pipeline
Our full prediction pipeline can be reproduced with
cd pipeline
bash run.sh
Citation
@article{ye2024datamixinglaws,
title={Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance},
author={Ye, Jiasheng and Liu, Peiju and Sun, Tianxiang and Zhou, Yunhua and Zhan, Jun and Qiu, Xipeng},
journal={arXiv preprint arXiv:2403.16952},
year={2024}
}