Home

Awesome

Website Crawler for PHP

Tests Test Coverage Maintainability Packagist Downloads

A highly extendible, dependency free Crawler for HTML, PDFS or any other type of Documents.

Why another Page Crawler? Yes, indeed, there are already very good Crawlers around, therefore those where my goals:

Installation

Composer is required to install this library:

composer require nadar/crawler

In order to use the PDF Parser, the optional library smalot/pdfparser must be installed:

smalot/pdfparser

Usage

  1. First we need to provide the crawler the information what should be done with the results from a crawler run:

Create your handler, those are the classes which interact with the crawler in order to store your content/results somwehere. The afterRun() method will run whenever an URL is crawled and contains the results:

class MyCrawlHandler implements \Nadar\Crawler\Interfaces\HandlerInterface
{
    public function afterRun(\Nadar\Crawler\Result $result)
    {
        echo $result->title . " with content " . $result->content . " for url " . $result->url->getNormalized();
    }
    
    public function onSetup(Crawler $crawler)
    {
        // do some stuff before the crawler runs, maybe truncate your temporary table where the results should be stored.
    }
    
    public function onEnd(Crawler $crawler)
    {
        // runs when the crawler is finished, maybe synchronize your temporary index table with the "real" site index.
    }

}
  1. Then we attach the handler and setup all required informations for crawler:
$crawler = new Crawler('https://luya.io', new ArrayStorage, new LoopRunner);

// what kind of document types would you like to parse?
$crawler->addParser(new Nadar\Crawler\Parsers\Html);

// adding will increases memory consumption
// $crawler->addParser(new Nadar\Crawler\Parsers\Pdf);

// register your handler in order to interact with the results, maybe store them in a database?
$crawler->addHandler(new MyCrawlHandler);

// setup and start the crawl process
$crawler->setup();
$crawler->run();

Attention: Keep in mind that wen you enable the PDF Parser and have multiple concurrent requests this can drastically increases memory usage (Especially if there are large PDFs)! Therefore it's recommend to lower the concurrent value when enabling PDF Parser!

Benchmark

Of course those benchmarks may vary depending on internet connection, bandwidth, servers but we made all the tests under the same circumstances. The memory peak varys strong when using the PDF parsers, therefore we test only with HTML parser:

Index SizeConcurrent RequestsMemory PeakTimeStorage
308306MB19sArrayStorage
308306MB20sFileStorage

Still looking for a good website to use for benchmarking. See the benchmark.php file for the test setup.

Developer Informations

For a better understanding, here is en explenation of how the classes are capsulated and for what they are used.

Lifecycle

Crawler -> Job -> (ItemQueue -> Storage) -> RequestResponse -> Parser -> ParserResult -> Result