Home

Awesome

whats in the box?

Tools associated with What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus Alexandra Sasha Luccioni, Joseph D. Viviano ACL-IJCNLP 2021 https://arxiv.org/abs/2105.02732

This performs a simple analysis of common NLP corpora used for training language models.

intstallation

conda env create -f environment.yml

NOTE: This project ran into environment issues since we were running many different published models against the corpus, and environment.yml does not capture all of the dependencies as we generated different environments on different systems to produce the results. This is unfortunate, and if anyone wants to use the model wrappers from this tool and have issues, please file an issue here and I can try to help debug.

quickstart

To see all remote common crawl data available:

witb/list_cc.py

The idx parameter can be used to process a single wet file from the common crawl. Therefore, it is very easy to processes many files in parallel across multiple machines on a compute cluster (simply submitting jobs looping over idx should be sufficient).

See below for example usage for a single idx. See data/example.pkl for an example output file.

witb/main.py \
  --remote=https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-50/wet.paths.gz \
  --idx=0 \
  --output=outputs/test.pkl \
  --overwrite