Awesome
commoncrawl_downloader
Example usage:
docker build -t ccdl .
docker run -e NUM_CORES=8 -v $PWD/output:/app/output -it ccdl 0,1,2,3,4,5,6,7,8,9
There are 3679 blocks in total (numbered 0-3678 inclusive). To specify blocks, provide a comma-seperated list of block numbers as the argument (no spaces).
Resources required
3.5PB of network ingress in total is required. The final dataset should be (warning: this number is very rough and extrapolated; leave some slack space to be safe!) about 200TB. About 40k core days (non-hyperthreaded) are also required (again, a very rough estimate from extrapolation).
Output format
Each block outputs as a ~40GB .jsonl.zst
file (info: jsonlines, zstd). Each json object in the file has a text
field, containing the webpage, and a meta
field containing metadata about the language, the WARC headers, and the HTTP response headers.
Download order
python3 -c 'import random; random.seed(42); x = list(range(3679)); random.shuffle(x); print(",".join(map(str, x)))'