Home

Awesome

crawlergo

chromedp BlackHat EU Arsenal

A powerful browser crawler for web vulnerability scanners

English Document | 中文文档

crawlergo is a browser crawler that uses chrome headless mode for URL collection. It hooks key positions of the whole web page with DOM rendering stage, automatically fills and submits forms, with intelligent JS event triggering, and collects as many entries exposed by the website as possible. The built-in URL de-duplication module filters out a large number of pseudo-static URLs, still maintains a fast parsing and crawling speed for large websites, and finally gets a high-quality collection of request results.

crawlergo currently supports the following features:

Screenshot

Installation

Please read and confirm disclaimer carefully before installing and using。

Build

make build
make build_all
  1. crawlergo relies only on the chrome environment to run, go to download for the new version of chromium.
  2. Go to download page for the latest version of crawlergo and extract it to any directory. If you are on linux or macOS, please give crawlergo executable permissions (+x).
  3. Or you can modify the code and build it yourself.

If you are using a linux system and chrome prompts you with missing dependencies, please see TroubleShooting below

Quick Start

Go!

Assuming your chromium installation directory is /tmp/chromium/, set up 10 tabs open at the same time and crawl the testphp.vulnweb.com:

bin/crawlergo -c /tmp/chromium/chrome -t 10 http://testphp.vulnweb.com/

Docker usage

You can also use this with docker without headache:

git clone https://github.com/Qianlitp/crawlergo
docker build . -t crawlergo
docker run crawlergo http://testphp.vulnweb.com/

Using Proxy

bin/crawlergo -c /tmp/chromium/chrome -t 10 --request-proxy socks5://127.0.0.1:7891 http://testphp.vulnweb.com/

Calling crawlergo with python

By default, crawlergo prints the results directly on the screen. We next set the output mode to json, and the sample code for calling it using python is as follows:

#!/usr/bin/python3
# coding: utf-8

import simplejson
import subprocess


def main():
    target = "http://testphp.vulnweb.com/"
    cmd = ["bin/crawlergo", "-c", "/tmp/chromium/chrome", "-o", "json", target]
    rsp = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    output, error = rsp.communicate()
	#  "--[Mission Complete]--"  is the end-of-task separator string
    result = simplejson.loads(output.decode().split("--[Mission Complete]--")[1])
    req_list = result["req_list"]
    print(req_list[0])


if __name__ == '__main__':
    main()

Crawl Results

When the output mode is set to json, the returned result, after JSON deserialization, contains four parts:

Examples

crawlergo returns the full request and URL, which can be used in a variety of ways:

Bypass headless detect

crawlergo can bypass headless mode detection by default.

https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html

TroubleShooting

Parameters

Required parameters

Basic parameters

Expand input URL

Form auto-fill

Advanced settings for the crawling process

Other

Follow me

Weibo:@9ian1i Twitter: @9ian1i

Related articles:A browser crawler practice for web vulnerability scanning