Home

Awesome

Code Climate

A simple scanning system for the cloud

A lightweight scan pipeline for orchestrating third party tools, at scale and (optionally) using serverless infrastructure.

The point of this project is to make it easy to coordinate and parallelize third party tools with a simple scanning interface that produces consistent and database-agnostic output.

Outputs aggregate CSV for humans and machines, and detailed JSON for machines.

Can scan websites and domains for data on their HTTPS and email configuration, third party service usage, accessibility, and other things. Adding new scanners is relatively straightforward.

All scanners can be run locally using native Python multi-threading.

Some scanners can be executed inside Amazon Lambda for much higher levels of parallelization.

Most scanners work by using specialized third party tools, such as SSLyze or trustymail. Each scanner in this repo is meant to add the smallest wrapper possible around the responses returned from these tools.

There is also built-in support for using headless Chrome to efficiently measure sophisticated properties of web services. Especially powerful when combined with Amazon Lambda.

Requirements

domain-scan requires Python 3.6 or 3.7.

To install core dependencies:

pip install -r requirements.txt

You can install scanner- or gatherer-specific dependencies yourself. Or, you can "quick start" by just installing all dependencies for all scanners and/or all gatherers:

pip install -r requirements-scanners.txt
pip install -r requirements-gatherers.txt

If you plan on developing/testing domain-scan itself, install development requirements:

pip install -r requirements-dev.txt

Usage

Scan a domain. You must specify at least one "scanner" with --scan.

./scan whitehouse.gov --scan=pshtt

Scan a list of domains from a CSV. The CSV's header row will be ignored if the first cell starts with "Domain" (case-insensitive).

./scan domains.csv --scan=pshtt

Run multiple scanners on each domain:

./scan whitehouse.gov --scan=pshtt,sslyze

Append columns to each row with metadata about the scan itself, such as how long each individual scan took:

./scan example.com --scan=pshtt --meta

Scanners

Parallelization

It's important to understand that scans run in parallel by default, and data is streamed to disk immediately after each scan is done.

This makes domain-scan fast, as well as memory-efficient (the entire dataset doesn't need to be read into memory), but the order of result data is unpredictable.

By default, each scanner will spin up 10 parallel threads. You can override this value with --workers. To disable this and run sequentially through each domain (1 worker), use --serial.

If row order is important to you, either disable parallelization, or use the --sort parameter to sort the resulting CSVs once the scans have completed. (Note: Using --sort will cause the entire dataset to be read into memory.)

Lambda

The domain-scan tool can execute certain compatible scanners in Amazon Lambda, instead of locally.

This can allow the use of hundreds of parallel workers, and can speed up large scans by orders of magnitude. (Assuming that the domains you're scanning are disparate enough to avoid DDoS-ing any particular service!)

See docs/lambda.md for instructions on configuring scanners for use with Amazon Lambda.

Once configured, scans can be run in Lambda using the --lambda flag, like so:

./scan example.com --scan=pshtt,sslyze --lambda

Headless Chrome

This tool has some built-in support for instrumenting headless Chrome, both locally and inside of Amazon Lambda.

Install a recent version of Node (using a user-space version manager such as nvm or nodeenv is recommended).

Then install dependencies:

npm install

Chrome-based scanners use Puppeteer, a Node-based wrapper for headless Chrome that is maintained by the Chrome team. This means that Chrome-based scanners make use of Node, even while domain-scan itself is instrumented in Python. This makes initial setup a little more complicated.

It is recommended to use Lambda in production for Chrome-based scanners -- not only for the increased speed, but because they use a simpler and cleaner method of cross-language communication (the HTTP-based function call to Amazon Lambda itself).

Support for running headless Chrome locally is intended mostly for testing and debugging with fewer moving parts (and without risk of AWS costs). Lambda support is the expected method for production scanning use cases.

See below for how to structure a Chrome-based scanner.

See docs/lambda.md for how to build and deploy Lambda-based scanners.

Options

General options:

Output

All output files are placed into cache/ and results/ directories, whose location defaults to the current directory (./). Override the output home with --output.

Example: cache/pshtt/whitehouse.gov.json

Example: results/pshtt.csv

You can override the output directory by specifying --output.

It's possible for scans to save multiple CSV rows per-domain. For example, the a11y scan will have a row with details for each detected accessibility error.

Example: results/meta.json

Using with Docker

If you're using Docker Compose, run:

docker-compose up

(You may need to use sudo.)

To scan, prefix commands with docker-compose run:

docker-compose run scan <domain> --scan=<scanner>

Gathering hostnames

This tool also includes a facility for gathering domain names that end in one or more given suffixes (e.g. .gov or yahoo.com or .gov.uk) from various sources.

By default, only fetches third-level and higher domains (excluding second-level domains).

Usage:

./gather [source] [options]

Or gather hostnames from multiple sources separated by commas:

./gather [source1,source2,...,sourceN] [options]

Right now there's one specific source (Censys.io), and then a general way of sourcing URLs or files by whatever name is convenient.

Censys.io - The censys gatherer uses data from Censys.io, which has hostnames gathered from observed certificates, through the Google BigQuery API. Censys provides certificates observed from a nightly zmap scan of the IPv4 space, as well as certificates published to public Certificate Transparency logs.

Remote or local CSV - By using any other name besides censys, this will define a gatherer based on an HTTP/HTTPS URL or local path to a CSV. Its only option is a flag named after itself. For example, using a gatherer name of dap will mean that domain-scan expects --dap to point to the URL or local file.

Hostnames found from multiple sources are deduped, and filtered by suffix or base domain according to the options given.

The resulting gathered.csv will have the following columns:

See specific usage examples below.

General options:

censys: Data from Censys.io via Google BigQuery

Gathers hostnames from Censys.io via the Google BigQuery API.

Before using this, you need to:

For details on concepts, and how to test access in the web console:

Note that the web console access is based on access given to a Google account, but BigQuery API access via this script depends on access given to Google Cloud service account credentials.

To configure access, set one of two environment variables:

Options:

Example:

Find hostnames ending in either .gov or .fed.us from within Censys.io's certificate database

./gather censys --suffix=.gov,.fed.us

Gathering Usage Examples

To gather .gov hostnames from Censys.io:

./gather censys --suffix=.gov --debug

To gather .gov hostnames from a hosted CSV, such as one from the Digital Analytics Program:

./gather dap --suffix=.gov --dap=https://analytics.usa.gov/data/live/sites-extended.csv

Or to gather federal-only .gov hostnames from Censys' API, a remote CSV, and a local CSV:

./gather censys,dap,private --suffix=.gov --dap=https://analytics.usa.gov/data/live/sites-extended.csv --private=/path/to/private-research.csv --parents=https://github.com/GSA/data/raw/master/dotgov-domains/current-federal.csv

a11y setup

pa11y expects a config file at config/pa11y_config.json. Details and documentation for this config can be found in the pa11y repo.


A brief note on redirects:

For the accessibility scans we're running at 18F, we're using the pshtt scanner to follow redirects before the accessibility scan runs. Pulse.cio.gov is set up to show accessibility scans for live, non-redirecting sites. For example, if aaa.gov redirects to bbb.gov, we will show results for bbb.gov on the site, but not aaa.gov.

However, if you want to include results for redirecting site, note the following. For example, if aaa.gov redirects to bbb.gov, pa11y will run against bbb.gov (but the result will be recorded for aaa.gov).

In order to get the benefits of the pshtt scanner, all a11y scans must include it. For example, to scan gsa.gov:

./scan gsa.gov --scanner=pshtt,a11y

Because of domain-scan's caching, all the results of an pshtt scan will be saved in the cache/pshtt folder, and probably does not need to be re-run for every single ally scan.

Developing new scanners

Scanners are registered by creating a single Python file in the scanners/ directory, where the file is given the name of the scanner (plus the .py extension).

(Scanners that use Chrome are slightly different, require both a Python and JavaScript file, and their differences are documented below.)

Each scanner should define a few top-level functions and one variable that will be referenced at different points.

For an example of how a scanner works, start with scanners/noop.py. The noop scanner is a test scanner that does nothing (no-op), but it implements and documents a scanner's basic Python contract.

Scanners can implement 4 functions (2 required, 2 optional). In order of being called:

Scanners can implement a few top-level variables (1 required, others sometimes required):

In all of the above functions that receive it, environment is a dict that will contain (at least) a scan_method key whose value is either "local" or "lambda".

The environment dict will also include any key/value pairs returned by previous function calls. This means that data returned from init will be contained in the environment dict sent to init_domain. Similarly, data returned from both init and init_domain for a particular domain will be contained in the environment dict sent to the scan method for that domain.

In all of the above functions that receive it, options is a dict that contains a direct representation of the command-line flags given to the ./scan executable.

For example, if the ./scan command is run with the flags --scan=pshtt,sslyze --lambda, they will translate to an options dict that contains (at least) {"scan": "pshtt,sslyze", "lambda": True}.

Developing Chrome scanners

This tool has some built-in support for instrumenting headless Chrome, both locally and inside of Amazon Lambda.

To make a scanner that uses headless Chrome, create two files:

The Node file must export the following method as part of its modules.exports:

Below is a simplified example of a scan() method. A full scanner will be a bit more complicated -- see scanners/third_parties.js for a real use case.

module.exports = {
  scan: async (domain, environment, options, browser, page) => {

    // Catch each HTTP request made in the page context.
    page.on('request', (request) => {
      // process the request somehow
    });

    // Navigate to the page
    try {
      await page.goto(environment.url);
    } catch (exc) {
      // Error handling, including timeout handling.
    }

  }
}

Note that the corresponding Python file (e.g. scanners/third_parties.py) is still needed, and its init() and init_domain() functions can affect the environment object.

This can be used, for example, to provide a modified starting URL to the Node scan() function in the environment object, based on the results of previous (Python-based) scanners such as pshtt.

Public domain

This project is in the worldwide public domain. As stated in CONTRIBUTING:

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.