Awesome

crawler - A DSL for web crawling in Scala

The purpose of this project is to provide a nice DSL wrapper around the cumbersome htmlunit Java library. Here is an example taken from a unit test in this package:

import crawler._

class TestCrawler(output: java.io.OutputStream) extends Crawler {
  var result = ""
  def crawl = {
    navigateTo("http://www.google.com") {
      in(form having id("tsf")) {
        in(textField having id("lst-ib")) {
          typeIn("bplawler")
        }
        in(submit having name("btnK")) {
          click ==>
        }
      }
    }
    onCurrentPage {
      result = from(div having id("resultStats")) getTextContent
      val url = from(
        anchor having xPath("//div[@id='field_timetable_file-wrapper']/a")
      ).getAttributes.getNamedItem("href").getTextContent
      download(url).writeTo(output)
    }
  }
}

This TestCrawler class defines a crawl that will navigate to google, find the form whose id is tsf, type something into the form, then click on the submit button named btnK, which will then take us to a new page (the search results) where we can then grab the content of the resultStats div.

It also grabs URL from a link defined by XPath and downloads it to given OutputStream.

Alternatively you could just get bytes instead of writing downloaded data to OutputStream: download(url).getBytes.

Background

This DSL was created to simplify the code needed to programmatically access web pages and do something meaningful with the content. It is backed by the Java HtmlUnit library, which, according to their web site, provides a "GUI-less browser for Java programs." The library is very good at what it does, but I found that using it generally resulted in code that was pretty difficult to read.

This DSL is also my first attempt to write such a thing in Scala, so really this is just sort of an academic project to learn as much as I can about Scala and about writing DSL's in Scala. There are a few brittle areas in this thing that could greatly benefit from some clear error handling, but for what I was trying to do, the code here did the trick just fine.

Basic Language Structure

The first part of any web crawl is to provide a starting point. This is done with the navigateTo method. navigateTo takes a string and is followed by the code block that contains the stuff you want to do with this page.

navigateTo("http://www.google.com") { ... }

Inside the code block, you can use the DSL keywords to find individual HTML elements and do operations on those things. On of the more common keywords is in, which receives as an argument a bit of code that identifies an HTML node on the current page, then opens another code block to do processing within that HTML node. The following excerpt will navigate to the google home page and find the form element that has an id of tsf.

navigateTo("http://www.google.com") {
  in(form having id("tsf")) { ... }
}

The code block after the in call will operate on the form element that was found. If there was no form with that id on the page, you'll get an error but it will be the one that is generated by HtmlUnit - I have not yet made any effort to wrap the errors nicely. Inside the code block for this form, we can do things like access the individual input fields and enter in values.

navigateTo("http://www.google.com") {
  in(form having id("tsf")) {
    in(textField having id("lst-ib")) {
      typeIn("bplawler")
    }
    ...
  }
}

Here we have expanded the example to find the text field on the Google home page and type in a search term. With this typed in, the next thing we will want to do is submit the form and do our search.

navigateTo("http://www.google.com") {
  in(form having id("tsf")) {
    in(textField having id("lst-ib")) {
      typeIn("bplawler")
    }
    in(submit having name("btnK") {
      click ==>
    }
  }
}

Clicking the button is as easy as finding the submit button in the HTML and calling click. But what is that wierd ==> operator? It turns out that this click on our GUI-less browser will take us to a new web page. The ==> operator without an argument signifies that this new page is the next page we will be working with. So rather than having to use navigateTo again, we can simply end this code block and use the onCurrentPage method to start our next code block.

navigateTo("http://www.google.com") { ... }
onCurrentPage {
  result = from(div having id("resultStats")) getTextContent
}

In this example, what we are doing is using the from keyword to find a particular HTML element (just as with in) but this time we are going to get something out of the element and put that value in a variable. Remember that this DSL is just an extension of Scala, and that we could also just as easily now call out to another method from within here and do some meaningful work. One other keyword that is supported is forAll which receives an XPath and a subsequent code block over which all of the items in the list will be run.

navigateTo("http://www.google.com") { ... }
onCurrentPage {
  result = from(div having id("resultStats")) getTextContent
  
  forAll(div having xPath("""//ol[@id = "rso"]/li/div[@class = "vsc"]""")) {
    println(from(anchor having xPath("h3/a")) getTextContent)
  }
}

This invocation of forAll will loop through each individual search result and print out the main anchor text for each.

Releases

0.6.0 (2013.08.16)
- Switched to Scala 2.10.2 for building (with 2.10 binary compatibility).
- Added #download method.
0.5.0 (2012.06.24)
- Bumping the version number to something that is completely unrelated to circupon.
- During crawler construction, it is now possible to set whether css is supported (default is false) and whether JavaScript is supported (default is true).
0.3.3 (2012.06.15)
- Added support for mouseOver in the DSL. Works just like click.