Home

Awesome

html2rss logo

Gem Version Yard Docs Retro Badge: valid RSS

html2rss is a Ruby gem that generates RSS 2.0 feeds from websites automatically, and as a fallback via feed config.

With the feed config, you provide a URL to scrape and CSS selectors for extracting information (like title, URL, etc.). The gem builds the RSS feed accordingly. Extractors and chainable post processors make information extraction, processing, and sanitizing a breeze. The gem also supports scraping JSON responses and setting HTTP request headers.

Looking for a ready-to-use app to serve generated feeds via HTTP? Check out html2rss-web!

Support the development by sponsoring this project on GitHub. Thank you! 💓

Generating a feed on the CLI

Install Ruby (latest version is recommended) on your machine and run gem install html2rss in your terminal.

After the installation has finished, html2rss help will print usage information.

using automatic generation

html2rss offers an automatic RSS generation feature. Try it with:

html2rss auto https://unmatchedstyle.com/

creating a feed config file and using it

If the results are not to your satisfaction, you can create a feed config file.

Create a file called my_config_file.yml with this sample content:

channel:
  url: https://unmatchedstyle.com
selectors:
  items:
    selector: "article[id^='post-']"
  title:
    selector: h2
  link:
    selector: a
    extractor: href
  description:
    selector: ".post-content"
    post_process:
      - name: sanitize_html

Build the feed from this config with: html2rss feed ./my_config_file.yml.

Generating a feed with Ruby

You can also install it as a dependency in your Ruby project:

🤩 Like it?Star it! ⭐️
Add this line to your Gemfile:gem 'html2rss'
Then execute:bundle
In your code:require 'html2rss'

Here's a minimal working example using Ruby:

require 'html2rss'

rss =
  Html2rss.feed(
    channel: { url: 'https://stackoverflow.com/questions' },
    selectors: {
      items: { selector: '#hot-network-questions > ul > li' },
      title: { selector: 'a' },
      link: { selector: 'a', extractor: 'href' }
    }
  )

puts rss

The feed config and its options

A feed config consists of a channel and a selectors hash. The contents of both hashes are explained below.

Good to know:

Alright, let's move on.

The channel

attributetypedefaultremark
urlrequiredString
titleoptionalStringauto-generated
descriptionoptionalStringauto-generated
ttloptionalInteger360TTL in minutes
time_zoneoptionalString'UTC'TimeZone name
languageoptionalString'en'Language code
authoroptionalStringFormat: email (Name)
headersoptionalHash{}Set HTTP request headers. See notes below.
jsonoptionalBooleanfalseHandle JSON response. See notes below.

Dynamic parameters in channel attributes

Sometimes there are structurally similar pages with different URLs. In such cases, you can add dynamic parameters to the channel's attributes.

Example of a dynamic id parameter in the channel URLs:

channel:
  url: "http://domainname.tld/whatever/%<id>s.html"

Command line usage example:

bundle exec html2rss feed the_feed_config.yml id=42
<details><summary>See a Ruby example</summary>
config = Html2rss::Config.new({ channel: { url: 'http://domainname.tld/whatever/%<id>s.html' } }, {}, { id: 42 })
Html2rss.feed(config)
</details>

See the more complex formatting options of the sprintf method.

The selectors

First, you must give an items selector hash, which contains a CSS selector. The selector selects a collection of HTML tags from which the RSS feed items are built. Except for the items selector, all other keys are scoped to each item of the collection.

To build a valid RSS 2.0 item, you need at least a title or a description. You can have both.

Having an items and a title selector is enough to build a simple feed.

Your selectors hash can contain arbitrary named selectors, but only a few will make it into the RSS feed (due to the RSS 2.0 specification):

RSS 2.0 tagname in html2rssremark
titletitle
descriptiondescriptionSupports HTML.
linklinkA URL.
authorauthor
categorycategoriesSee notes below.
guidguidDefault title/description. See notes below.
enclosureenclosureSee notes below.
pubDateupdatedAn instance of Time.
commentscommentsA URL.
sourcesourceNot yet supported.

Build RSS 2.0 item attributes by specifying selectors

Every named selector (i.e. title, description, see table above) in your selectors hash can have these attributes:

namevalue
selectorThe CSS selector to select the tag with the information.
extractorName of the extractor. See notes below.
post_processA hash or array of hashes. See notes below.

Using extractors

Extractors help with extracting the information from the selected HTML tag.

Extractors might need extra attributes on the selector hash. 👉 Read their docs for usage examples.

<details><summary>See a Ruby example</summary>
Html2rss.feed(
  channel: {}, selectors: { link: { selector: 'a', extractor: 'href' } }
)
</details> <details><summary>See a YAML feed config example</summary>
channel:
  # ... omitted
selectors:
  # ... omitted
  link:
    selector: "a"
    extractor: "href"
</details>

Using post processors

Extracted information can be further manipulated with post processors.

name
gsubAllows global substitution operations on Strings (Regexp or simple pattern).
html_to_markdownHTML to Markdown, using reverse_markdown.
markdown_to_htmlconverts Markdown to HTML, using kramdown.
parse_timeParses a String containing a time in a time zone.
parse_uriParses a String as URL.
sanitize_htmlStrips unsafe and uneeded HTML and adds security related attributes.
substringCuts a part off of a String, starting at a position.
templateBased on a template, it creates a new String filled with other selectors values.

⚠️ Always make use of the sanitize_html post processor for HTML content. Never trust the internet! ⚠️

Chaining post processors

Pass an array to post_process to chain the post processors.

<details><summary>YAML example: build the description from a template String (in Markdown) and convert that Markdown to HTML</summary>
channel:
  # ... omitted
selectors:
  # ... omitted
  price:
    selector: '.price'
  description:
    selector: '.section'
    post_process:
      - name: template
        string: |
          # %{self}

          Price: %{price}
      - name: markdown_to_html
</details>
Post processor gsub

The post processor gsub makes use of Ruby's gsub method.

keytyperequirednote
patternStringyesCan be Regexp or String.
replacementStringyesCan be a backreference.
<details><summary>See a Ruby example</summary>
Html2rss.feed(
  channel: {},
  selectors: {
    title: { selector: 'a', post_process: [{ name: 'gsub', pattern: 'foo', replacement: 'bar' }] }
  }
)
</details> <details><summary>See a YAML feed config example</summary>
channel:
  # ... omitted
selectors:
  # ... omitted
  title:
    selector: "a"
    post_process:
      - name: "gsub"
        pattern: "foo"
        replacement: "bar"
</details>

Adding <category> tags to an item

The categories selector takes an array of selector names. Each value of those selectors will become a <category> on the RSS item.

<details> <summary>See a Ruby example</summary>
Html2rss.feed(
  channel: {},
  selectors: {
    genre: {
      # ... omitted
      selector: '.genre'
    },
    branch: { selector: '.branch' },
    categories: %i[genre branch]
  }
)
</details> <details> <summary>See a YAML feed config example</summary>
channel:
  # ... omitted
selectors:
  # ... omitted
  genre:
    selector: ".genre"
  branch:
    selector: ".branch"
  categories:
    - genre
    - branch
</details>

Custom item GUID

By default, html2rss generates a GUID from the title or description.

If this does not work well, you can choose other attributes from which the GUID is build. The principle is the same as for the categories: pass an array of selectors names.

In all cases, the GUID is a SHA1-encoded string.

<details><summary>See a Ruby example</summary>
Html2rss.feed(
  channel: {},
  selectors: {
    title: {
      # ... omitted
      selector: 'h1'
    },
    link: { selector: 'a', extractor: 'href' },
    guid: %i[link]
  }
)
</details> <details><summary>See a YAML feed config example</summary>
channel:
  # ... omitted
selectors:
  # ... omitted
  title:
    selector: "h1"
  link:
    selector: "a"
    extractor: "href"
  guid:
    - link
</details>

Adding an <enclosure> tag to an item

An enclosure can be any file, e.g. a image, audio or video - think Podcast.

The enclosure selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's URL as base.

Since html2rss does no further inspection of the enclosure, its support comes with trade-offs:

  1. The content-type is guessed from the file extension of the URL.
  2. If the content-type guessing fails, it will default to application/octet-stream.
  3. The content-length will always be undetermined and therefore stated as 0 bytes.

Read the RSS 2.0 spec for further information on enclosing content.

<details> <summary>See a Ruby example</summary>
Html2rss.feed(
  channel: {},
  selectors: {
    enclosure: { selector: 'audio', extractor: 'attribute', attribute: 'src' }
  }
)
</details> <details> <summary>See a YAML feed config example</summary>
channel:
  # ... omitted
selectors:
  # ... omitted
  enclosure:
    selector: "audio"
    extractor: "attribute"
    attribute: "src"
</details> ## Scraping and handling JSON responses

By default, html2rss assumes the URL responds with HTML. However, it can also handle JSON responses. The JSON must return an Array or Hash.

keyrequireddefaultnote
jsonoptionalfalseIf set to true, the response is parsed as JSON.
jsonpathoptional$Use JSONPath syntax to select nodes of interest.
<details><summary>See a Ruby example</summary>
Html2rss.feed(
  channel: { url: 'http://domainname.tld/whatever.json', json: true },
  selectors: { title: { selector: 'foo' } }
)
</details> <details><summary>See a YAML feed config example</summary>
channel:
  url: "http://domainname.tld/whatever.json"
  json: true
selectors:
  title:
    selector: "foo"
</details>

Customization of how requests to the channel URL are sent

By default, html2rss issues a naiive HTTP request and extracts information from the response. That is performant and works for many websites.

However, modern websites often do not render much HTML on the server, but evaluate JavaScript on the client to create the HTML. In such cases, the default strategy will not find the "juicy content".

Use Browserless.io

You can use Browserless.io to run a Chrome browser and return the website's source code after the website generated it. For this, you can either run your own Browserless.io instance (Docker image available -- read their license!) or pay them for a hosted instance.

To run a local Browserless.io instance, you can use the following Docker command:

docker run \
  --rm \
  -p 3000:3000 \
  -e "CONCURRENT=10" \
  -e "TOKEN=6R0W53R135510" \
  ghcr.io/browserless/chromium

To make html2rss use your instance,

  1. specify the environment variables accordingly, and
  2. use the browserless strategy for those websites.

When running locally with commands from above, you can skip setting the environment variables, as they are aligned with the default values.

BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \
  html2rss auto --strategy=browserless https://example.com

When using traditional feed configs, inside your channel config set strategy: browserless.

<details><summary>See a YAML feed config example</summary>
channel:
  url: https://www.imdb.com/user/ur67728460/ratings
  time_zone: UTC
  ttl: 1440
  strategy: browserless
  headers:
    User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
selectors:
  items:
    selector: "li.ipc-metadata-list-summary-item"
  title:
    selector: ".ipc-title__text"
    post_process:
      - name: gsub
        pattern: "/^(\\d+.)\\s/"
        replacement: ""
      - name: template
        string: "%{self} rated with: %{user_rating}"
  link:
    selector: "a.ipc-title-link-wrapper"
    extractor: "href"
  user_rating:
    selector: "[data-testid='ratingGroup--other-user-rating'] > .ipc-rating-star--rating"
</details>

Set any HTTP header in the request

To set HTTP request headers, you can add them to the channel's headers hash. This is useful for APIs that require an Authorization header.

channel:
  url: "https://example.com/api/resource"
  headers:
    Authorization: "Bearer YOUR_TOKEN"
selectors:
  # ... omitted

Or for setting a User-Agent:

channel:
  url: "https://example.com"
  headers:
    User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
selectors:
  # ... omitted

Usage with a YAML config file

This step is not required to work with this gem. If you're using html2rss-web and want to create your private feed configs, keep on reading!

First, create a YAML file, e.g. feeds.yml. This file will contain your global config and multiple feed configs under the key feeds.

Example:

headers:
  "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
feeds:
  myfeed:
    channel:
    selectors:
  myotherfeed:
    channel:
    selectors:

Your feed configs go below feeds. Everything else is part of the global config.

Find a full example of a feeds.yml at spec/fixtures/feeds.test.yml.

Now you can build your feeds like this:

<details> <summary>Build feeds in Ruby</summary>
require 'html2rss'

myfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myfeed')
myotherfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myotherfeed')
</details> <details> <summary>Build feeds on the command line</summary>
html2rss feed feeds.yml myfeed
html2rss feed feeds.yml myotherfeed
</details>

Display the RSS feed nicely in a web browser

To display RSS feeds nicely in a web browser, you can:

A web browser will apply these stylesheets and show the contents as described.

In a CSS stylesheet, you'd use element selectors to apply styles.

If you want to do more, then you need to create a XSLT. XSLT allows you to use a HTML template and to freely design the information of the RSS, including using JavaScript and external resources.

You can add as many stylesheets and types as you like. Just add them to your global configuration.

<details> <summary>Ruby: a stylesheet config example</summary>
config = Html2rss::Config.new(
  { channel: {}, selectors: {} }, # omitted
  {
    stylesheets: [
      {
        href: '/relative/base/path/to/style.xls',
        media: :all,
        type: 'text/xsl'
      },
      {
        href: 'http://example.com/rss.css',
        media: :all,
        type: 'text/css'
      }
    ]
  }
)

Html2rss.feed(config)
</details> <details> <summary>YAML: a stylesheet config example</summary>
stylesheets:
  - href: "/relative/base/path/to/style.xls"
    media: "all"
    type: "text/xsl"
  - href: "http://example.com/rss.css"
    media: "all"
    type: "text/css"
feeds:
  # ... omitted
</details>

Recommended further readings:

Gotchas and tips & tricks

Contributing

Find ideas what to contribute in:

  1. https://github.com/orgs/html2rss/discussions
  2. the issues tracker: https://github.com/html2rss/html2rss/issues

To submit changes:

  1. Fork this repo ( https://github.com/html2rss/html2rss/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Implement a commit your changes (git commit -am 'feat: add XYZ')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request using the Github web UI

Development Helpers

  1. bin/setup: installs dependencies and sets up the development environment.
  2. bin/guard: automatically runs rspec, rubocop and reek when a file changes.
  3. for a modern Ruby development experience: install ruby-lsp and integrate it to your IDE: a. Ruby in Visual Studio Code