Home

Awesome

goscrape

Godoc Build Status

goscrape is a extensible structured scraper for Go. What does a "structured scraper" mean? In this case, it means that you define what you want to extract from a page in a structured, hierarchical manner, and then goscrape takes care of pagination, splitting the input page, and calling the code to extract chunks of data. However, goscrape is extensible, allowing you to customize nearly every step of this process.

The architecture of goscrape is roughly as follows:

This all sounds rather complicated, but in practice it's quite simple. Here's a short example of how to get a list of all the latest news articles from Wired and dump them as JSON to the screen:

package main

import (
	"encoding/json"
	"fmt"
	"os"

	"github.com/andrew-d/goscrape"
	"github.com/andrew-d/goscrape/extract"
)

func main() {
	config := &scrape.ScrapeConfig{
		DividePage: scrape.DividePageBySelector("#latest-news li"),

		Pieces: []scrape.Piece{
			{Name: "title", Selector: "h5.exchange-sm", Extractor: extract.Text{}},
			{Name: "byline", Selector: "span.byline", Extractor: extract.Text{}},
			{Name: "link", Selector: "a", Extractor: extract.Attr{Attr: "href"}},
		},
	}

	scraper, err := scrape.New(config)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error creating scraper: %s\n", err)
		os.Exit(1)
	}

	results, err := scraper.Scrape("http://www.wired.com")
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error scraping: %s\n", err)
		os.Exit(1)
	}

	json.NewEncoder(os.Stdout).Encode(results)
}

As you can see, the entire example, including proper error handling, only takes 36 lines of code - short and sweet.

For more usage examples, see the examples directory.

Roadmap

Here's the rough roadmap of things that I'd like to add. If you have a feature request, please let me know by opening an issue!

License

MIT