Awesome
Funkspector
Web page inspector for Elixir.
Funkspector is a web scraper that lets you extract data from web pages and XML or TXT sitemaps.
Usage
Resolving URLs
Simply pass Funkspector the URL to resolve and it will return its final URL after following redirections:
iex> { :ok, final_url, _ } = Funkspector.resolve("http://github.com")
Page Scraping
Simply pass Funkspector the URL of a web page to inspect and it will return its scraped data:
iex> { :ok, document } = Funkspector.page_scrape("https://rocketvalidator.com")
Sitemap Scraping
Funkspector can extract the locations from XML sitemaps, like this:
iex> { :ok, document } = Funkspector.sitemap_scrape("https://rocketvalidator.com/sitemap.xml")
It also supports TXT sitemaps:
iex> { :ok, document } = Funkspector.text_sitemap_scrape("https://rocketvalidator.com/sitemap.txt")
Custom options
You can pass in options to customize the timeout and User Agent string.
For example, you could use:
Funkspector.page_scrape("http://github.com", %{recv_timeout: 5_000, user_agent: "My Bot"})
Funkspector.sitemap_scrape("http://rocketvalidator.com/sitemap.xml", %{recv_timeout: 5_000, user_agent: "My Bot"})
Funkspector.text_sitemap_scrape("http://rocketvalidator.com/sitemap.txt", %{recv_timeout: 5_000, user_agent: "My Bot"})
Loading a document contents instead of requesting
You can skip the HTTP request of the document if you already have the contents of the document. This is useful in cases where you already have the contents from a previous request or cache. For example:
Funkspector.page_scrape("https://example.com", contents: "<html>...</html>")
Scraped data
For a successful response you'll get a Funkspector.Document
with the scraped data, which will depend on the kind of scraper used. All data will be found inside the :data
attribute.
Error response
In case of error, Funkspector will return the original_url
and the reason from the server:
case Funkspector.page_scrape("http://example.com") do
{ :ok, document } ->
IO.inspect(data)
{ :error, url, reason } ->
IO.puts "Could not scrape #{url} because of #{reason}"
end
Installation
If available in Hex, the package can be installed as:
-
Add funkspector to your list of dependencies in
mix.exs
:def deps do [{:funkspector, "~> 0.10"}] end
-
Ensure funkspector is started before your application:
def application do [applications: [:funkspector]] end