Awesome
Hoover is a search tool for large collections of documents. It glues together proven open-source technologies like elasticsearch and Apache Tika to aid the work of investigative journalists.
Searching is done through a user-friendly web interface that leverages Lucene's rich query syntax. Hoover also provides an API to run queries using the elasticsearch query DSL.
Installation
Development
There is a test suite; run it with ./run testsuite
on the
hoover-search
container.
Running in production
Waitress is installed as part of the dependencies. It's a production-quality threaded wsgi server. Pick a port number, say 8888, and run it like this - it doesn't daemonize so you can start it from supervisor or another modern daemon manager:
./run server --host=127.0.0.1 --port=8888
Then you probably want to set up a reverse proxy in front of the app. Here's the minimal nginx config:
location / {
proxy_pass http://localhost:8888;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-Proto $scheme;
}
Configuration
To customize hoover's behaviour you can set the following Django settings in
hoover/site/settings/local.py
:
HOOVER_HYPOTHESIS_EMBED_URL
: The URL to embed the Hypothesis client, e.g.https://hypothes.is/embed.js
Snoop and external collections
For a large dataset, it's not practical to upload files through the admin UI, so you can use hoover-snoop. It's a tool for pre-processing a collection, extracting metadata from emails and documents, and accessing the contents of archives and email attachments. Snoop comes as a standalone Django app, it listens on an HTTP port where it serves document previews and raw documents, and it handles indexing of documents in elasticsearch by itself.
To use it with hoover-search, first set up the snoop service, analyze the data, send it to elasticsearch, then go back to hoover-snoop and create a new collection of type External with the following options:
{
"documents": "http://localhost:8001/doc",
"renderDocument": true
}
The documents
URL is composed of the URL of hoover-snoop
(http://localhost:8001
in this example) followed by /doc
.
renderDocument
tells hoover-search to use the new doc.html
view from
hoover-ui to render the document preview pages. If you're not using
hoover-ui then omit this flag.
Run tests locally
Install the drone CLI binary from their website onto your PATH. Install Docker CE, latest version.
Then, run ./run-tests
with arguments you'd normally pass to py.test
, like this:
./run-tests -vvv -x -k ratelimits
During the test a docker-setup
directory will be created. Make sure to delete it after
running the tests with sudo rm -r docker-setup
.