Awesome
RENDLER :interrobang:
A rendering web-crawler framework for Apache Mesos.
See the accompanying slides for more context.
RENDLER consists of three main components:
CrawlExecutor
extendsmesos.Executor
RenderExecutor
extendsmesos.Executor
RenderingCrawler
extendsmesos.Scheduler
and launches tasks with the executors
Quick Start with Vagrant
Requirements
- VirtualBox 4.1.18+
- Vagrant 1.3+
- git (command line tool)
Start the mesos-demo
VM
$ wget http://downloads.mesosphere.io/demo/mesos.box -O /tmp/mesos.box
$ vagrant box add --name mesos-demo /tmp/mesos.box
$ git clone https://github.com/mesosphere/RENDLER.git
$ cd RENDLER
$ vagrant up
Now that the VM is running, you can view the Mesos Web UI here: http://10.141.141.10:5050
You can see that 1 slave is registered and you've got some idle CPUs and Memory. So let's start the Rendler!
Run RENDLER in the mesos-demo
VM
Check implementations of the RENDLER scheduler in the python
, go
,
scala
, and cpp
directories. Run instructions are here:
Feel free to contribute your own!
Generating a pdf of your render graph output
With GraphViz (which dot
) installed:
vagrant@mesos:hostfiles $ bin/make-pdf
Generating '/home/vagrant/hostfiles/result.pdf'
Open result.pdf
in your favorite viewer to see the rendered result!
Sample Output
Shutting down the mesos-demo
VM
# Exit out of the VM
vagrant@mesos:hostfiles $ exit
# Stop the VM
$ vagrant halt
# To delete all traces of the vagrant machine
$ vagrant destroy
Rendler Architecture
Crawl Executor
- Interprets incoming tasks'
task.data
field as a URL - Fetches the resource, extracts links from the document
- Sends a framework message to the scheduler containing the crawl result.
Render Executor
- Interprets incoming tasks'
task.data
field as a URL - Fetches the resource, saves a png image to a location accessible to the scheduler.
- Sends a framework message to the scheduler containing the render result.
Intermediate Data Structures
We define some common data types to facilitate communication between the scheduler and the executors. Their default representation is JSON.
results.CrawlResult(
"1234", # taskId
"http://foo.co", # url
["http://foo.co/a", "http://foo.co/b"] # links
)
results.RenderResult(
"1234", # taskId
"http://foo.co", # url
"http://dl.mega.corp/foo.png" # imageUrl
)
Rendler Scheduler
Data Structures
crawlQueue
: list of urlsrenderQueue
: list of urlsprocessedURLs
: set or urlscrawlResults
: list of url tuplesrenderResults
: map of urls to imageUrls
Scheduler Behavior
The scheduler accepts one URL as a command-line parameter to seed the render and crawl queues.
-
For each URL, create a task in both the render queue and the crawl queue.
-
Upon receipt of a crawl result, add an element to the crawl results adjacency list. Append to the render and crawl queues each URL that is not present in the set of processed URLs. Add these enqueued urls to the set of processed URLs.
-
Upon receipt of a render result, add an element to the render results map.
-
The crawl and render queues are drained in FCFS order at a rate determined by the resource offer stream. When the queues are empty, the scheduler declines resource offers to make them available to other frameworks running on the cluster.