Awesome
OpenWebText2
This project is part of EleutherAI's quest to create a massive repository of high quality text data for training language models.
Very briefly, OpenWebText2 is a large filtered dataset of text documents scraped from URL found on Reddit submisisons.
The plug and play version of OpenWebText2 contains:
- 17,103,059 documents
- 65.86GB uncompressed text
Download Dataset / Documentation
For further information please visit our documentation.
Acknowledgements
researcher2 Wrote much of this code, with inspiration and some straight copying of the scraping code found here.<br/> sdtblck kindly put together the Colab notebook, and performed a chunk of the scraping. <br/> leogao2 provided overall design guidance, lm_dataformat, and performed another chunk of scraping. <br /> Colaboratory VMs helped us with about 10% of our overall scraping. <br /> The Eye host our processed datasets.<br /> Read The Docs host our documentation.<br />