Home

Awesome

Netsearch

Abstract

Netarchive is a open source Maven project that can process a very large number of arc/warc-files (Web ARChive file format) and make the content of the archive searchable in a Solr-server cluster (SolrCloud). The search-results can then be shown in the WebArchive viewer.

Scalability

The solution is scalable with growing index-size without reducing the search performance. More specific we require non-faceted/grouping search times to be very fast < 200ms and faceted/grouping search time < 2000ms.

Software components

  1. Archon. WAR-application that keeps track of the arc/warc files book keeping. Uses a database for persistence.
  2. Arctika: Java program that builds a given index(shard) and manage a worker pool of jobs that each process a arc/warc-file and submits the extracted meta-data to solr. The workers uses (https://github.com/ukwa/webarchive-discovery) for reading the arc-files using Tika for text extraction.
  3. A Solr-Cloud cluster where you can add new servers(shards). Each index is put into a Solr server instance(shard) in the cluster. A zookeeper emsemble monitors the cluster.
  4. Front-end server for searching and showing the results. We use the open source project SolrAjax for this. This will likely be replaced with a better front-end solution later.
  5. WebArchive server (Front-end server for displaying the websites)

Hardware configuration

  1. Index-builder server
  1. Solr-Cluster server(s)

Arcfiles/index ration

100000 arc/warc files (100MB each) produces ~1 TB index (optimized)

Netsearch on GitHub

https://github.com/netarchivesuite/netsearch

Releases

For a full install of Arctika,Archon,SolrCloud and Zookeeper you can download the full release package here: https://github.com/netarchivesuite/netsearch/tree/master/releases/version1.0

Each folder has an install guide. You only need to git clone and build the warc-indexer project for a jar-file, the rest is included in the release package.

Validation tool

The sub-module netarchive-warcindexvalidationtool can validate a Arc/Warc file. And can also validate the Warc-index has created the correct number of documents in the Solr-index from a given Arc/Warc file. See the README.TXT in this module.

Performance test

https://plus.google.com/+TokeEskildsen/posts/4yPvzrQo8A7

Thomas Egense 2014-06-06