Home

Awesome

Common Crawl Logo

Cognito Common Crawl

This program uses pywren to search common crawl

Setup

virtualenv env/
source env/bin/activate
pip install -r requirements.txt

Set your AWS credentials as [default] in ~/.aws/credentials and make sure your default region is set to us-east-1.

Then configure pywren:

pywren get_aws_account_id
pywren create_config --force

Edit the ~/.pywren_config file and specify:

pywren create_bucket
pywren create_role
pywren deploy_lambda

Confirm that everything is working using pywren test_function

Configuration

Change the following in cc-lambda.py:

Running the application

Application runs will spawn multiple lambda functions that analyze common crawl WARC files at scale. Running this function will have an impact on your AWS billing!

The application reads the input/warc.paths file and writes to:

When calling cc-lambda.py the script will check if there are any WARC paths in the input which were not already processed or failed, and go through those. Remove processed.paths and failed.paths if you want to re-process all WARC paths.

HTTP responses that match the search are stored in the MATCH_S3_BUCKET S3 bucket.

$ python cc-lambda.py 
No handlers could be found for logger "pywren.executor"
Overall progress: 1.55%
Going to process 250 WARC paths
Got futures from map(), waiting for results...

crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/warc/CC-MAIN-20190215183319-20190215205319-00000.warc.gz
  - Time (seconds): 191.205149174
  - Processed pages: 44969
  - Ignored pages: 93005
  - Matches: {'aws_re_matcher': 9, 'cognito_matcher': 4}

After running the application a few times, and fine-tuning your search, you can leave it running against all the common crawl dataset:

while python cc-lambda.py; do :; done

Debugging

Remember: AWS Lambda sends logs to CloudWatch and you can access the logs here.

PYWREN_LOGLEVEL=INFO python cc-lambda.py

Costs

As the Common Crawl dataset lives in the Amazon Public Datasets program, you can access and process it without incurring any transfer costs.

The costs you'll incur by running this software are:

The highest cost will come from AWS lambda. In order to reduce this cost you should:

After running the tool a few times make sure you also run lambda-cost-calculator:

+---------------------+-----------+--------------------------+-----------------------------+
| Function            | Region    | Cost in the Last Day ($) | Monthly Cost Estimation ($) |
+---------------------+-----------+--------------------------+-----------------------------+
| pywren_cc_search_v3 | us-east-1 | 6.410                    | 192.296                     |
+---------------------+-----------+--------------------------+-----------------------------+
Total monthly cost estimation: $192.296

Monitoring

It is possible to monitor the progress of the analysis functions using:

pywren print_latest_logs | grep total_seen

And the progress of the whole solution using:

pywren print_latest_logs | grep -v Running