Home

Awesome

RSS River for Elasticsearch (PROJECT STOPPED)

Welcome to the RSS River Plugin for Elasticsearch

In order to install the plugin, run:

bin/plugin -install fr.pilato.elasticsearch.river/rssriver/1.3.0

You need to install a version matching your Elasticsearch version:

ElasticsearchRSS River PluginDocs
masterBuild from sourceSee below
es-1.xBuild from source1.5.0-SNAPSHOT
es-1.4Build from source1.4.0-SNAPSHOT
es-1.31.3.01.3.0
es-1.11.1.01.1.0
es-1.01.0.01.0.0
es-0.900.3.00.3.0

To build a SNAPSHOT version, you need to build it with Maven:

mvn clean install
plugin --install rssriver \ 
       --url file:target/releases/rssriver-X.X.X-SNAPSHOT.zip

Build Status

Thanks to cloudbees for the build status : build status

Getting Started

Creating a RSS river

We create first an index to store all the feed documents :

$ curl -XPUT 'localhost:9200/lemonde/' -d '{}'

We create the river with the following properties :

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
    	"name": "lemonde",
    	"url": "http://www.lemonde.fr/rss/une.xml"
    	}
    ]
  }
}'

This RSS feed follows RSS 2.0 specifications and provide a ttl entry. The update rate will be auto-adjusted following this value.

If you want to set your own refresh rate (if not provided) and force it (even if it's provided), use update_rate and ignore_ttl options:

We create the river with the following properties :

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
    	"name": "lemonde",
    	"url": "http://www.lemonde.fr/rss/une.xml",
    	"update_rate": "15m",
    	"ignore_ttl": true
    	}
    ]
  }
}'

If you need to get multiple feeds, you can add them :

Feed1

Feed2

$ curl -XPUT 'localhost:9200/actus/' -d '{}'

$ curl -XPUT 'localhost:9200/_river/actus/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
			"name": "lemonde",
			"url": "http://www.lemonde.fr/rss/une.xml",
			"update_rate": "15m"
    	}, {
			"name": "lefigaro",
			"url": "http://rss.lefigaro.fr/lefigaro/laune",
			"update_rate": "30m",
			"ignore_ttl": true
    	}
    ]
  }
}'

Indexing raw encoded content

By default, any encoded content provided in content:encoded will be indexed under raw.TYPE field, where TYPE depends on encoded content type, for example html.

You can disable this if you want to save some disk space for example, using raw setting:

$ curl -XPUT 'localhost:9200/_river/actus/_meta' -d '{
  "type": "rss",
  "rss": {
    "raw" : false,
    "feeds" : [ {
			"url": "http://www.lemonde.fr/rss/une.xml"
    	}
    ]
  }
}'

Working with mappings

If you don't define an explicit mapping before starting RSS river, one will be created by default:

{
  "page" : {
    "properties" : {
      "feedname" : {
        "type" : "string",
        "index" : "not_analyzed"
      },
      "title" : {
        "type" : "string"
      },
      "author" : {
        "type" : "string"
      },
      "description" : {
        "type" : "string"
      },
      "link" : {
        "type" : "string",
        "index" : "no"
      },
      "source" : {
        "type" : "string"
      },
      "publishedDate" : {
        "type" : "date",
        "format" : "dateOptionalTime",
        "store" : "yes"
      },
      "location" : {
        "type" : "geo_point"
      },
      "categories" : {
        "type" : "string",
        "index" : "not_analyzed"
      },
      "enclosures" : {
        "properties" : {
          "url" : {
            "type" : "string",
            "index" : "no"
          },
          "type" : {
            "type" : "string",
            "index" : "not_analyzed"
          },
          "length" : {
            "type" : "long",
            "index" : "no"
          }
        }
      },
      "medias" : {
        "properties" : {
          "type" : {
            "type" : "string",
            "index" : "not_analyzed"
          },
          "reference" : {
            "type" : "string",
            "index" : "no"
          },
          "language" : {
            "type" : "string",
            "index" : "not_analyzed"
          },
          "title" : {
            "type" : "string"
          },
          "description" : {
            "type" : "string"
          },
          "duration" : {
            "type" : "long",
            "index" : "no"
          },
          "width" : {
            "type" : "long",
            "index" : "no"
          },
          "height" : {
            "type" : "long",
            "index" : "no"
          }
        }
      },
      "raw" : {
        "properties" : {
          "html" : {
            "type" : "string",
            "index" : "no"
          }
        }
      }
    }
  }
}

If you want to define your own mapping, push it to elasticsearch before RSS river starts:

$ curl -XPUT 'http://localhost:9200/lefigaro/' -d '{}'

$ curl -XPUT 'http://localhost:9200/lefigaro/page/_mapping' -d '{
    "page" : {
        "properties" : {
            "feedname" : {"type" : "string"},
            "title" : {"type" : "string", "analyzer" : "french"},
            "description" : {"type" : "string", "analyzer" : "french"},
            "author" : {"type" : "string"},
            "link" : {"type" : "string"}
        }
    }
}'

Then, your feed will use it when you create the river :

$ curl -XPUT 'localhost:9200/_river/lefigaro/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
		    "url": "http://rss.lefigaro.fr/lefigaro/laune"
	    }
    ]
  }
}'

Bulk settings

By default, documents are indexed every 25 feed documents or every 5 seconds in river name index under a page type. You can change those settings when creating the river:

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
    	"name": "lemonde",
    	"url": "http://www.lemonde.fr/rss/une.xml"
    	}
    ]
  },
  "index": {
    "index": "myindexname",
    "type": "mycontent",
    "bulk_size": 100,
    "flush_interval": "30s"
  }
}'

Behind the scene

RSS river downloads RSS feed every update_rate milliseconds and check if there is new messages.

At first, RSS river look at the <channel> tag. It reads the optional <pubDate> tag and store it in Elasticsearch to compare it on next launch.

Then, for each <item> tag, RSS river creates a new document with the following properties:

XML PathES Mapping
/titletitle
/descriptiondescription
/content:encodedraw.html
/authorauthor
/linklink
/categorycategory
/geo:lat /geo:longlocation
/enclosures[@url]enclosures.url
/enclosures[@type]enclosures.type
/enclosures[@length]enclosures.length
/media:content[@height]medias.height
/media:content[@width]medias.width
/media:content[@url]medias.reference
/media:content[@type]medias.type
/media:content[@duration]medias.duration
/media:content[@lang]medias.language
/media:descriptionmedias.description
/media:titlemedias.title

<content:encoded> tag will be stored in raw object. If html content, it will be stored as raw.html.

ID is generated from description using the UUID generator. So, each message is indexed only once.

Read RSS 2.0 Specification for more details about RSS channels.

To Do List

Many many things to do :

License

This software is licensed under the Apache 2 license, quoted below.

Copyright 2011-2014 David Pilato

Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.