Awesome
Jupyter Notebooks to Analyze Common Crawl Data
- analyzing data using the columnar index
- blocking of internet connections from and to the Islamic Republic of Iran during the November 2019 crawl: net-blocking-iran-cc-main-2019-47.ipynb
- total number of captures 2013 – 2019, domain coverage and approximation of unique URLs for the
.edu
top-level domain: cc-main-2013-2019-metrics.ipynb - correlations between character sets and lanuages: correlation-language-charset.ipynb
- analyze the Common Crawl webgraph data sets and interactively explore the graphs: cc-webgraph-statistics
- how to explore WARC files running a notebook on AWS EMR
- truncated record payloads in WARC Files:
- verify that all truncated payloads are annotated by the WARC-Truncated header
- which MIME types are mostly affected by truncation? Aggregations using the columnar index.