Home

Awesome

OSIRRC Docker Image for PISA

Build Status DOI

Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel

This is the docker image for the PISA: Performant Indexes and Search for Academia (v0.6.6) conforming to the OSIRRC jig for the Open-Source IR Replicability Challenge (OSIRRC) at SIGIR 2019. This image is available on Docker Hub has been tested with the jig at commit f6c6ef4 (19/6/2019).

Quick Start

The following jig command can be used to index TREC disks 4/5 for robust04:

python run.py prepare 
  --repo osirrc2019/pisa \
  --tag v0.1.1 \
  --collections Robust04=/data/collections/disk45

The following jig command can be used to perform a retrieval run on the collection with the robust04 test collection.

python run.py search \
  --repo osirrc2019/pisa \
  --tag v0.1.1 \
  --collection Robust04 \
  --topic topics/topics.robust04.txt \
  --output $(pwd)/output \
  --qrels $(pwd)/qrels/qrels.robust04.txt

Retrieval Methods

The PISA image supports the following retrieval methods:

Runtime Options

The default search system can be changed. For example, we allow a few different index compression and search algorithms to be used. These options are supplied using --opts [option]=[value]

Index

Search

Supported collections

For indexing, the corpus name defines the indexing configuration. The following values are supported:

A note on default configuration

As discussed above, the default configuration is as follows:

Since the Variable-sized blocks depend on a parameter, lambda, we have searched for the correct value of lambda offline, and hardcoded these values into the lamb() method within the index call. We found values of lambda that result in a mean block size of 40 elements per block, with a possible error rate of plus/minus 0.5 elements. Please note that these lambda values only apply to the default parsing and indexing setup, and would need to be searched again if settings are changed (for example, if a different stemmer was used).

Expected Results

robust04

BM25MAPP@30NDCG@20
TREC 2004 Robust Track Topics0.25340.31200.4221

core17

BM25MAPP@30NDCG@20
TREC 2017 Common Core Track Topics0.20780.42600.3898

core18

BM25MAPP@30NDCG@20
TREC 2018 Common Core Track Topics0.23840.35000.3927

gov2

BM25MAPP@30NDCG@20
TREC 2004 Terabyte Track: Topics 701-7500.26380.47760.4070
TREC 2005 Terabyte Track: Topics 751-8000.33050.54870.5073
TREC 2006 Terabyte Track: Topics 801-8500.29500.46800.4925

cw09b

BM25MAPP@30NDCG@20
TREC 2010 Web Track: Topics 51-1000.10090.25210.1509
TREC 2011 Web Track: Topics 101-1500.10930.25070.2177
TREC 2012 Web Track: Topics 151-2000.10540.21000.1311

cw12b

BM25MAPP@30NDCG@20
TREC 2013 Web Track: Topics 201-2500.04490.19400.1529
TREC 2014 Web Track: Topics 251-3000.02170.12400.1484

Implementation

The following is a quick breakdown of what happens in each of the scripts in this repo.

Dockerfile

The Dockerfile derives from the official PISA docker image. Additionally, it installs dependencies (python3, etc.), copies scripts to the root dir, and sets the working dir to /work.

init

The init script is empty since all the initialization is executed during Docker image building.

index

The index script reads a JSON string (see here) containing at least one collection to index (including the name, path, and format). The collection is indexed and placed in a directory, with the same name as the collection, in the working dir (i.e., /work/robust04). At this point, jig takes a snapshot and the indexed collections are persisted for the search hook.

search

The search script reads a JSON string (see here) containing the collection name (to map back to the index directory from the index hook) and topic path, among other options. The retrieval run is performed (using additional --opts params, see above) and output is placed in /output for the jig to evaluate using trec_eval.

Reviews