Home

Awesome

Pupa.rb: A Data Scraping Framework

Gem Version Build Status Coverage Status Code Climate

Pupa.rb is a Ruby 2.x fork of Python Pupa. It implements an Extract, Transform and Load (ETL) process to scrape data from online sources, transform it, and write it to a database.

gem install pupa

Usage

You can scrape any sort of data with Pupa.rb using your own models. You can also use Pupa.rb to scrape people, organizations, memberships and posts according to the Popolo open government data specification. This gem is up-to-date with Popolo's 2014-10-28 version.

The cat.rb example shows you how to:

The bill.rb example shows you how to:

The legislator.rb example shows you how to:

The organization.rb example shows you how to:

Scraping method selection

  1. For simple processing, your processor class need only define a single scrape_objects method, which will perform all scraping. See cat.rb for an example.

  2. If you scrape many types of data from the same source, you may want to split the scraping into separate tasks according to the type of data being scraped. See bill.rb for an example.

  3. You may want more control over the method used to perform a scraping task. For example, a legislature may publish legislators before 1997 in one format and legislators after 1997 in another format. In this case, you may want to select the method used to scrape legislators according to the year. See legislator.rb.

Automatic response parsing

JSON parsing is enabled by default. To enable automatic parsing of HTML and XML, require the nokogiri and multi_xml gems.

Performance

Pupa.rb offers several ways to significantly improve performance. Read the documentation.

Integration with ODMs

Pupa::Model is incompatible with Mongoid::Document. Don't do this:

class Cat
  include Pupa::Model
  include Mongoid::Document
end

Instead, have a simple scraping model that includes Pupa::Model and an app model that includes Mongoid::Document with your app's business logic.

What it tries to solve

Pupa.rb's goal is to make scraping less painful by solving common problems:

In short, Pupa.rb lets you spend more time on the tasks that are unique to your use case, and less time on common tasks like caching, merging and storing data. It also provides helpful features like:

Pupa.rb is extensible, so that you can add your own models, parsers, helpers, actions, etc. It also offers several ways to improve your scraper's performance.

Python Pupa differences

Both Pupa.rb and Python Pupa implement models from the Popolo open government data specifications, but Pupa.rb also lets you use your own classes. Pupa.rb stores data in either MongoDB (default) or PostgreSQL (experimental); Python Pupa stores data in PostgreSQL. The PostgreSQL schema of Pupa.rb and Python Pupa differ.

Testing

DO NOT run this gem's specs if you are using Redis database number 15 on localhost!

Copyright (c) 2013 James McKinney, released under the MIT license