Awesome

RSS River for Elasticsearch (PROJECT STOPPED)

Welcome to the RSS River Plugin for Elasticsearch

In order to install the plugin, run:

bin/plugin -install fr.pilato.elasticsearch.river/rssriver/1.3.0

You need to install a version matching your Elasticsearch version:

Elasticsearch	RSS River Plugin	Docs
master	Build from source	See below
es-1.x	Build from source	1.5.0-SNAPSHOT
es-1.4	Build from source	1.4.0-SNAPSHOT
es-1.3	1.3.0	1.3.0
es-1.1	1.1.0	1.1.0
es-1.0	1.0.0	1.0.0
es-0.90	0.3.0	0.3.0

To build a SNAPSHOT version, you need to build it with Maven:

mvn clean install
plugin --install rssriver \ 
       --url file:target/releases/rssriver-X.X.X-SNAPSHOT.zip

Build Status

Thanks to cloudbees for the build status :

Getting Started

Creating a RSS river

We create first an index to store all the feed documents :

$ curl -XPUT 'localhost:9200/lemonde/' -d '{}'

We create the river with the following properties :

Feed URL : http://www.lemonde.fr/rss/une.xml

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
    	"name": "lemonde",
    	"url": "http://www.lemonde.fr/rss/une.xml"
    	}
    ]
  }
}'

This RSS feed follows RSS 2.0 specifications and provide a ttl entry. The update rate will be auto-adjusted following this value.

If you want to set your own refresh rate (if not provided) and force it (even if it's provided), use update_rate and ignore_ttl options:

We create the river with the following properties :

Feed URL: http://www.lemonde.fr/rss/une.xml
Update Rate: every 15 minutes
Ignore TTL : true

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
    	"name": "lemonde",
    	"url": "http://www.lemonde.fr/rss/une.xml",
    	"update_rate": "15m",
    	"ignore_ttl": true
    	}
    ]
  }
}'

If you need to get multiple feeds, you can add them :

Feed1

URL : http://www.lemonde.fr/rss/une.xml
Update Rate1 : every 15 minutes (will be modified by provided TTL)

Feed2

URL : http://rss.lefigaro.fr/lefigaro/laune
Update Rate2 : every 30 minutes
Ignore TTL : true

$ curl -XPUT 'localhost:9200/actus/' -d '{}'

$ curl -XPUT 'localhost:9200/_river/actus/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
			"name": "lemonde",
			"url": "http://www.lemonde.fr/rss/une.xml",
			"update_rate": "15m"
    	}, {
			"name": "lefigaro",
			"url": "http://rss.lefigaro.fr/lefigaro/laune",
			"update_rate": "30m",
			"ignore_ttl": true
    	}
    ]
  }
}'

Indexing raw encoded content

By default, any encoded content provided in content:encoded will be indexed under raw.TYPE field, where TYPE depends on encoded content type, for example html.

You can disable this if you want to save some disk space for example, using raw setting:

$ curl -XPUT 'localhost:9200/_river/actus/_meta' -d '{
  "type": "rss",
  "rss": {
    "raw" : false,
    "feeds" : [ {
			"url": "http://www.lemonde.fr/rss/une.xml"
    	}
    ]
  }
}'

Working with mappings

If you don't define an explicit mapping before starting RSS river, one will be created by default:

{
  "page" : {
    "properties" : {
      "feedname" : {
        "type" : "string",
        "index" : "not_analyzed"
      },
      "title" : {
        "type" : "string"
      },
      "author" : {
        "type" : "string"
      },
      "description" : {
        "type" : "string"
      },
      "link" : {
        "type" : "string",
        "index" : "no"
      },
      "source" : {
        "type" : "string"
      },
      "publishedDate" : {
        "type" : "date",
        "format" : "dateOptionalTime",
        "store" : "yes"
      },
      "location" : {
        "type" : "geo_point"
      },
      "categories" : {
        "type" : "string",
        "index" : "not_analyzed"
      },
      "enclosures" : {
        "properties" : {
          "url" : {
            "type" : "string",
            "index" : "no"
          },
          "type" : {
            "type" : "string",
            "index" : "not_analyzed"
          },
          "length" : {
            "type" : "long",
            "index" : "no"
          }
        }
      },
      "medias" : {
        "properties" : {
          "type" : {
            "type" : "string",
            "index" : "not_analyzed"
          },
          "reference" : {
            "type" : "string",
            "index" : "no"
          },
          "language" : {
            "type" : "string",
            "index" : "not_analyzed"
          },
          "title" : {
            "type" : "string"
          },
          "description" : {
            "type" : "string"
          },
          "duration" : {
            "type" : "long",
            "index" : "no"
          },
          "width" : {
            "type" : "long",
            "index" : "no"
          },
          "height" : {
            "type" : "long",
            "index" : "no"
          }
        }
      },
      "raw" : {
        "properties" : {
          "html" : {
            "type" : "string",
            "index" : "no"
          }
        }
      }
    }
  }
}

If you want to define your own mapping, push it to elasticsearch before RSS river starts:

$ curl -XPUT 'http://localhost:9200/lefigaro/' -d '{}'

$ curl -XPUT 'http://localhost:9200/lefigaro/page/_mapping' -d '{
    "page" : {
        "properties" : {
            "feedname" : {"type" : "string"},
            "title" : {"type" : "string", "analyzer" : "french"},
            "description" : {"type" : "string", "analyzer" : "french"},
            "author" : {"type" : "string"},
            "link" : {"type" : "string"}
        }
    }
}'

Then, your feed will use it when you create the river :

$ curl -XPUT 'localhost:9200/_river/lefigaro/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
		    "url": "http://rss.lefigaro.fr/lefigaro/laune"
	    }
    ]
  }
}'

Bulk settings

By default, documents are indexed every 25 feed documents or every 5 seconds in river name index under a page type. You can change those settings when creating the river:

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
    	"name": "lemonde",
    	"url": "http://www.lemonde.fr/rss/une.xml"
    	}
    ]
  },
  "index": {
    "index": "myindexname",
    "type": "mycontent",
    "bulk_size": 100,
    "flush_interval": "30s"
  }
}'

Behind the scene

RSS river downloads RSS feed every update_rate milliseconds and check if there is new messages.

At first, RSS river look at the <channel> tag. It reads the optional <pubDate> tag and store it in Elasticsearch to compare it on next launch.

Then, for each <item> tag, RSS river creates a new document with the following properties:

XML Path	ES Mapping
`/title`	title
`/description`	description
`/content:encoded`	raw.html
`/author`	author
`/link`	link
`/category`	category
`/geo:lat` `/geo:long`	location
`/enclosures[@url]`	enclosures.url
`/enclosures[@type]`	enclosures.type
`/enclosures[@length]`	enclosures.length
`/media:content[@height]`	medias.height
`/media:content[@width]`	medias.width
`/media:content[@url]`	medias.reference
`/media:content[@type]`	medias.type
`/media:content[@duration]`	medias.duration
`/media:content[@lang]`	medias.language
`/media:description`	medias.description
`/media:title`	medias.title

<content:encoded> tag will be stored in raw object. If html content, it will be stored as raw.html.

ID is generated from description using the UUID generator. So, each message is indexed only once.

Read RSS 2.0 Specification for more details about RSS channels.

To Do List

Many many things to do :

As <pubDate> tag is optional, we have to check if RSS River is working in that case and parse each feed message
Support more RSS <channel> sub-elements, such as <category>, <skipDays>, <skipHours>
Support more RSS <item> sub-elements, such as <pubDate>
Support for multi-channel (one per language for instance)
Use <guid> as the text to encode to generate ID

License

This software is licensed under the Apache 2 license, quoted below.

Copyright 2011-2014 David Pilato

Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.