Home

Awesome

OpenWebText2

This project is part of EleutherAI's quest to create a massive repository of high quality text data for training language models.

Very briefly, OpenWebText2 is a large filtered dataset of text documents scraped from URL found on Reddit submisisons.

The plug and play version of OpenWebText2 contains:

Download Dataset / Documentation

For further information please visit our documentation.

Acknowledgements

researcher2 Wrote much of this code, with inspiration and some straight copying of the scraping code found here.<br/> sdtblck kindly put together the Colab notebook, and performed a chunk of the scraping. <br/> leogao2 provided overall design guidance, lm_dataformat, and performed another chunk of scraping. <br /> Colaboratory VMs helped us with about 10% of our overall scraping. <br /> The Eye host our processed datasets.<br /> Read The Docs host our documentation.<br />