Awesome
crawler - A DSL for web crawling in Scala
The purpose of this project is to provide a nice DSL wrapper around the cumbersome htmlunit Java library. Here is an example taken from a unit test in this package:
import crawler._
class TestCrawler(output: java.io.OutputStream) extends Crawler {
var result = ""
def crawl = {
navigateTo("http://www.google.com") {
in(form having id("tsf")) {
in(textField having id("lst-ib")) {
typeIn("bplawler")
}
in(submit having name("btnK")) {
click ==>
}
}
}
onCurrentPage {
result = from(div having id("resultStats")) getTextContent
val url = from(
anchor having xPath("//div[@id='field_timetable_file-wrapper']/a")
).getAttributes.getNamedItem("href").getTextContent
download(url).writeTo(output)
}
}
}
This TestCrawler
class defines a crawl that will navigate to google,
find the form whose id is tsf
, type something into the form, then
click on the submit button named btnK
, which will then take us to a
new page (the search results) where we can then grab the content of the
resultStats
div.
It also grabs URL from a link defined by XPath and downloads it to given OutputStream.
Alternatively you could just get bytes instead of writing downloaded data to
OutputStream: download(url).getBytes
.
Background
This DSL was created to simplify the code needed to programmatically access web pages and do something meaningful with the content. It is backed by the Java HtmlUnit library, which, according to their web site, provides a "GUI-less browser for Java programs." The library is very good at what it does, but I found that using it generally resulted in code that was pretty difficult to read.
This DSL is also my first attempt to write such a thing in Scala, so really this is just sort of an academic project to learn as much as I can about Scala and about writing DSL's in Scala. There are a few brittle areas in this thing that could greatly benefit from some clear error handling, but for what I was trying to do, the code here did the trick just fine.
Basic Language Structure
The first part of any web crawl is to provide a starting point. This is
done with the navigateTo
method. navigateTo
takes a string and is
followed by the code block that contains the stuff you want to do with
this page.
navigateTo("http://www.google.com") { ... }
Inside the code block, you can use the DSL keywords to find individual HTML
elements and do operations on those things. On of the more common keywords is
in
, which receives as an argument a bit of code that identifies an HTML
node on the current page, then opens another code block to do processing within
that HTML node. The following excerpt will navigate to the google home page
and find the form element that has an id of tsf
.
navigateTo("http://www.google.com") {
in(form having id("tsf")) { ... }
}
The code block after the in
call will operate on the form element that was
found. If there was no form with that id on the page, you'll get an error
but it will be the one that is generated by HtmlUnit - I have not yet made
any effort to wrap the errors nicely. Inside the code block for this form,
we can do things like access the individual input fields and enter in
values.
navigateTo("http://www.google.com") {
in(form having id("tsf")) {
in(textField having id("lst-ib")) {
typeIn("bplawler")
}
...
}
}
Here we have expanded the example to find the text field on the Google home page and type in a search term. With this typed in, the next thing we will want to do is submit the form and do our search.
navigateTo("http://www.google.com") {
in(form having id("tsf")) {
in(textField having id("lst-ib")) {
typeIn("bplawler")
}
in(submit having name("btnK") {
click ==>
}
}
}
Clicking the button is as easy as finding the submit button in the HTML and
calling click
. But what is that wierd ==>
operator? It turns out that
this click on our GUI-less browser will take us to a new web page. The
==>
operator without an argument signifies that this new page is the next
page we will be working with. So rather than having to use navigateTo
again, we can simply end this code block and use the onCurrentPage
method
to start our next code block.
navigateTo("http://www.google.com") { ... }
onCurrentPage {
result = from(div having id("resultStats")) getTextContent
}
In this example, what we are doing is using the from
keyword to find a
particular HTML element (just as with in
) but this time we are going to
get something out of the element and put that value in a variable. Remember
that this DSL is just an extension of Scala, and that we could also just as
easily now call out to another method from within here and do some meaningful
work. One other keyword that is supported is forAll
which receives an
XPath and a subsequent code block over
which all of the items in the list will be run.
navigateTo("http://www.google.com") { ... }
onCurrentPage {
result = from(div having id("resultStats")) getTextContent
forAll(div having xPath("""//ol[@id = "rso"]/li/div[@class = "vsc"]""")) {
println(from(anchor having xPath("h3/a")) getTextContent)
}
}
This invocation of forAll
will loop through each individual search result
and print out the main anchor text for each.
Releases
- 0.6.0 (2013.08.16)
- Switched to Scala 2.10.2 for building (with 2.10 binary compatibility).
- Added #download method.
- 0.5.0 (2012.06.24)
- Bumping the version number to something that is completely unrelated to circupon.
- During crawler construction, it is now possible to set whether css is supported (default is false) and whether JavaScript is supported (default is true).
- 0.3.3 (2012.06.15)
- Added support for
mouseOver
in the DSL. Works just likeclick
.
- Added support for