Awesome
OSIRRC Docker Image for PISA
Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel
This is the docker image for the PISA: Performant Indexes and Search for Academia (v0.6.6) conforming to the OSIRRC jig for the Open-Source IR Replicability Challenge (OSIRRC) at SIGIR 2019. This image is available on Docker Hub has been tested with the jig at commit f6c6ef4 (19/6/2019).
- Supported test collections:
robust04
,core17
,core18
(newswire);gov2
,cw09b
,cw12b
(web) - Supported hooks:
init
,index
,search
Quick Start
The following jig
command can be used to index TREC disks 4/5 for robust04
:
python run.py prepare
--repo osirrc2019/pisa \
--tag v0.1.1 \
--collections Robust04=/data/collections/disk45
The following jig
command can be used to perform a retrieval run on the collection with the robust04
test collection.
python run.py search \
--repo osirrc2019/pisa \
--tag v0.1.1 \
--collection Robust04 \
--topic topics/topics.robust04.txt \
--output $(pwd)/output \
--qrels $(pwd)/qrels/qrels.robust04.txt
Retrieval Methods
The PISA image supports the following retrieval methods:
- BM25: k1=0.9, b=0.4 (Robertson et al., 1995)
Runtime Options
The default search system can be changed. For example, we allow a few different index compression and search algorithms
to be used. These options are supplied using --opts [option]=[value]
Index
stemmer
can be eitherkrovetz
orporter2
, default isporter2
.compressor
can beopt
(Partitioned Elias Fano),block_interpolative
(Interpolative),block_simdbp
(SIMD-BP128), orblock_optpfor
(OPT-PFor Delta),block_simdbp
is the default. Multiple compressors can be passed using a comma delimiter, eg:--opts compressor="block_simdbp,opt"
.block_type
can be eitherfixed
orvariable
, default isvariable
. Iffixed
is used,block_size
must also be supplied, whereblock_size
is a positive integer.skip_reordering=1
if provided document reordering will not be performed
Search
stemmer
is the same as above, and is used for stemming queries.compressor
is the same as above, and should match the givencompressor
used during indexing. However, only one singlecompressor
can be provided at a time.algorithm
can bewand
,maxscore
, orblock_max_wand
.block_max_wand
is the default.block_type
is the same as above, and should match the givenblock_type
used during indexing. For search,block_size
is not required.
Supported collections
For indexing, the corpus name defines the indexing configuration. The following values are supported:
- robust04 - TREC Disks 4&5.
- core17 - the New York Times corpus.
- core18 - the TREC Washington Post (WAPO) corpus.
- gov2 - the TREC GOV2 corpus.
- cw09b - the TREC ClueWeb09 corpus.
- cw12b - the TREC ClueWeb12 corpus.
A note on default configuration
As discussed above, the default configuration is as follows:
- Porter 2 Stemming
- SIMD-BP128 compression
- Variable-sized blocks and Block-Max WAND, leading to the "Variable BMW" algorithm
Since the Variable-sized blocks depend on a parameter, lambda, we have searched for
the correct value of lambda offline, and hardcoded these values into the lamb()
method within the index
call. We found values of lambda that result in a mean
block size of 40 elements per block, with a possible error rate of plus/minus 0.5
elements. Please note that these lambda values only apply to the default parsing
and indexing setup, and would need to be searched again if settings are changed
(for example, if a different stemmer was used).
Expected Results
robust04
BM25 | MAP | P@30 | NDCG@20 |
---|---|---|---|
TREC 2004 Robust Track Topics | 0.2534 | 0.3120 | 0.4221 |
core17
BM25 | MAP | P@30 | NDCG@20 |
---|---|---|---|
TREC 2017 Common Core Track Topics | 0.2078 | 0.4260 | 0.3898 |
core18
BM25 | MAP | P@30 | NDCG@20 |
---|---|---|---|
TREC 2018 Common Core Track Topics | 0.2384 | 0.3500 | 0.3927 |
gov2
BM25 | MAP | P@30 | NDCG@20 |
---|---|---|---|
TREC 2004 Terabyte Track: Topics 701-750 | 0.2638 | 0.4776 | 0.4070 |
TREC 2005 Terabyte Track: Topics 751-800 | 0.3305 | 0.5487 | 0.5073 |
TREC 2006 Terabyte Track: Topics 801-850 | 0.2950 | 0.4680 | 0.4925 |
cw09b
BM25 | MAP | P@30 | NDCG@20 |
---|---|---|---|
TREC 2010 Web Track: Topics 51-100 | 0.1009 | 0.2521 | 0.1509 |
TREC 2011 Web Track: Topics 101-150 | 0.1093 | 0.2507 | 0.2177 |
TREC 2012 Web Track: Topics 151-200 | 0.1054 | 0.2100 | 0.1311 |
cw12b
BM25 | MAP | P@30 | NDCG@20 |
---|---|---|---|
TREC 2013 Web Track: Topics 201-250 | 0.0449 | 0.1940 | 0.1529 |
TREC 2014 Web Track: Topics 251-300 | 0.0217 | 0.1240 | 0.1484 |
Implementation
The following is a quick breakdown of what happens in each of the scripts in this repo.
Dockerfile
The Dockerfile
derives from the official PISA docker image. Additionally, it installs dependencies (python3, etc.), copies scripts to the root dir, and sets the working dir to /work.
init
The init
script is empty since all the initialization is executed during Docker image building.
index
The index
script reads a JSON string (see here) containing at least one collection to index (including the name, path, and format).
The collection is indexed and placed in a directory, with the same name as the collection, in the working dir (i.e., /work/robust04
).
At this point, jig
takes a snapshot and the indexed collections are persisted for the search
hook.
search
The search
script reads a JSON string (see here) containing the collection name (to map back to the index directory from the index
hook) and topic path, among other options.
The retrieval run is performed (using additional --opts
params, see above) and output is placed in /output
for the jig
to evaluate using trec_eval
.
Reviews
- Documentation reviewed at commit
8f88235
(2019-06-17) by Ryan Clancy