Awesome
<!-- ./init.sh docker build . -t jassv2/osirrc2019 python3 run.py prepare --repo jassv2/osirrc2019 --collections robust04=/Users/andrew/programming/JASSv2/docker/osirrc2019/robust04=trectext python3 run.py search --repo jassv2/osirrc2019 --collection robust04 --topic topics.robust04.301-450.601-700.txt --top_k 100 --output /Users/andrew/programming/osirrc2019/jass-docker/output --qrels qrels/qrels.robust2004.txt -->OSIRRC Docker Image for JASSv2
This readme is heavily based (i.e. copied from) the Anserini readme.
This is the docker image for JASSv2 conforming to the OSIRRC jig for the Open-Source IR Replicability Challenge (OSIRRC) at SIGIR 2019. This image is available on Docker Hub. The OSIRRC 2019 image library contains a log of successful executions of this image.
Although JASSv2 is a fully stand alone search system with indexer and search engine, this build relies on ATIRE for indexing. The JASSv2 indexer is currently experimental (as of June 2019), and has not been tested at scale.
- Supported test collections:
robust04
, andcore17
. - Supported hooks:
init
,index
,search
Points of Difference
ATIRE is a full stand alone indexer and search engine that supports a large number of parameters and supports many experiments. It normally creates a Term-Frequency ordered index and searches using a Score-at-a-Time paradigm. It can be cooreced into generating an Impact ordered index and so does support full Score-at-a-Time searchine.
JASS was the search engine written by Andrew Trotman and Jimmy Lin for the "Anytime" paper - a demonstration of the utility of Score-at-a-Time early termination base on a time limit. JASS was a hack from the start and remains so.
JASSv2 is a total re-write of ATIRE and JASS. It is faster at searching, supports parallel search, and is currently in development. If you want fast, generate a quantised index using ATIRE and then convert into JASS index using JASS, then search with JASSv2 (setting max postings to 10 percent of collection size).
Quick Start
The following jig
command can be used to index TREC disks 4/5 for robust04
:
python3 run.py prepare \
--repo osirrc2019/atire \
--tag v0.1.0 \
--collections robust04=/path/to/disk45=trectext
e.g. python3 run.py prepare --repo jassv2/osirrc2019 --tag v0.1.0 --collections robust04=/Users/andrew/programming/JASSv2/docker/osirrc2019/robust04=trectext
The following jig
command can be used to perform a retrieval run on the collection with the robust04
test collection.
python3 run.py search \
--repo osirrc2019/atire \
--tag v0.1.0 \
--output out/atire \
--qrels qrels/qrels.robust04.txt \
--topic topics/topics.robust04.txt \
--collection robust04 \
--top_k 100"
e.g. python3 run.py search --repo jassv2/osirrc2019 --tag v0.1.0 --collection robust04 --topic topics/topics.robust04.txt --top_k 100 --output /Users/andrew/programming/osirrc2019/jass-docker/output --qrels qrels/qrels.robust04.txt
Retrieval Methods
This instance of JASSv2 uses BM25 from ATIRE with the default parameters. JASSv2 requires an impact ordered index which is generated by ATIRE then converted into the JASS index format
Expected Results
The following numbers should be able to be re-produced using the scripts provided by the jig.
robust04
TREC 2004 Robust Track Topics.
- BM25: k1=0.9, b=0.4 (Robertson et al., 1995)
Metric | Score |
---|---|
MAP | 0.1984 |
P@30 | 0.2991 |
core17
TREC 2017 Common Core Track Topics.
- BM25: k1=0.9, b=0.4 (Robertson et al., 1995)
Metric | Score |
---|---|
MAP | 0.1415 |
P@30 | 0.4080 |
Implementation
The following is a quick breakdown of what happens in each of the scripts in this repo.
Dockerfile
The Dockerfile
installs dependencies (python3
, etc.), copies scripts to the root dir, and sets the working dir to /work
.
init
The init
script is straightforward - it's simply a shell script (via the #!/usr/bin/env sh
she-bang) that downloads and builds ATIRE and JASS.
index
The index
Python script (via the #!/usr/bin/python3
she-bang) reads a JSON string (see here) containing at least one collection to index (including the name, path, and format).
The collection is indexed and placed in the current working directory (i.e., /work
).
At this point, jig
takes a snapshot and the indexed collections are persisted for the search
hook.
search
The search
script reads a JSON string (see here) containing the collection name (to map back to the index directory from the index
hook) and topic path, among other options.
The retrieval run is performed and output is placed in /output
for the jig
to evaluate using trec_eval
.
References
- S. E. Robertson, S. Walker, M. Hancock-Beaulieu, M. Gatford, and A. Payne. (1995) Okapi at TREC-4. TREC
- A. Trotman, X.-F Jia, M. Crane (2012), Towards an Efficient and Effective Search Engine, Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval, pp. 40-47
- Y. Lv, CX. Zhai (2011) Lower-Bounding Term Frequency Normalization, Proceedings of CIKM'11, pp. 7-16
- J. Lin, A. Trotman (2015), Anytime Ranking for Impact-Ordered Indexes, Proceedings of the 2015 International Conference on The Theory of Information Retrieval (ICTIR 2015), pp. 301-304
- A. Trotman, M. Crane (2019), Micro and Macro Optimization of SAAT Search, Software: Practice and Experience, 49(5):942-950
Reviews
- Documentation reviewed at commit
dffe8bf
(2019-06-16) by Ryan Clancy.