Home

Awesome

OSIRRC 2019 Jig

DOI

This is the jig for the SIGIR 2019 Open-Source IR Replicability Challenge (OSIRRC 2019). Check out the OSIRRC 2019 image library for a list of images that have been contributed to this exercise.

What's a jig?

To get started, clone the jig, and then download + compile trec_eval (inside the jig directory) with the following command:

git clone https://github.com/usnistgov/trec_eval.git && make -C trec_eval

Make sure the Docker Python package is installed (via pip, conda, etc.):

pip install -r requirements.txt

Make sure the Docker daemon is running.

For common test collections, topics and qrels are already checked into this repo.

To test the jig with an Anserini image using default parameters, try:

python run.py prepare \
    --repo osirrc2019/anserini \
    --collections [name]=[path]=[format] ...

then

python run.py search \
    --repo osirrc2019/anserini \
    --collection [name] \
    --topic /path/to/topic \
    --output /path/to/output \
    --qrels /path/to/qrels

Change:

The output run files will appear in the argument of --output. The full command line parameters are below.

To run a container (from a saved image) that you can interact with, try:

python run.py interact \
    --repo osirrc2019/anserini \
    --tag latest

Collections

The following collections are supported:

NameURL
core17https://catalog.ldc.upenn.edu/LDC2008T19
core18https://trec.nist.gov/data/wapost/
cw09bhttp://lemurproject.org/clueweb09.php/
cw12bhttp://lemurproject.org/clueweb12/ClueWeb12-CreateB13.php
gov2http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm
robust04https://trec.nist.gov/data_disks.html

Command Line Options

Options with none as the default are required.

Command Line Options - prepare

python run.py prepare <options>

Option NameTypeDefaultExampleDescription
--repostringnone--repo osirrc2019/anserinithe repo on Docker Hub
--tagstringlatest--tag latestthe tag on Docker Hub
--collections[name]=[path]=[format] ...none--collections robust04=/path/to/robust04=trectext ...the collections to index
--save_to_snapshotstringsave--save_to_snapshot robust04-exp1used to determine the tag of the snapshotted image after indexing
--opts[key]=[value] ...none--opts index_args="-storeRawDocs"extra options passed to the index script
--versionstringnone--version 3b16584a7e3e7e3b93642a95675fc38396581bdfthe version string passed to the init script

Command Line Options - search

python run.py search <options>

Option NameTypeDefaultExampleDescription
--repostringnone--repo osirrc2019/anserinithe repo on Docker Hub
--tagstringlatest--tag latestthe tag on Docker Hub
--collectionstringnone--collection robust04the collections to index
--load_from_snapshotstringsave--load_from_snapshot robust04-exp1used to determine the tag of the snapshotted image to search from
--topicstringnone--topic topics/topics.robust04.301-450.601-700.txtthe path of the topic file
--topic_formatstringtrec--topic_format trecthe format of the topic file
--top_kint1000--top_k 500the number of results for top-k retrieval
--outputstringnone--output $(pwd)/outputthe output path for run files
--qrelsstringnone--qrels $(pwd)/qrels/qrels.robust2004.txtthe qrels file for evaluation
--opts[key]=[value] ...none--opts search_args="-bm25"extra options passed to the search script
--timingsflagfalse--timingsprint timing info (requires the time package, or bash, to be installed in Dockerfile)
--measuresstring ..."num_q map P.30"--measures recall.1000 mapthe measures for trec_eval
--gpubooleanFalse--gpu Trueflag to launch docker with nvidia runtime

Command Line Options - train

python run.py train <options>

Option NameTypeDefaultExampleDescription
--repostringnone--repo osirrc2019/anserinithe repo on Docker Hub
--tagstringlatest--tag latestthe tag on Docker Hub
--load_from_snapshotstringsave--load_from_snapshot robust04-exp1used to determine the tag of the snapshotted image to search from
--topicstringnone--topic topics/topics.robust04.301-450.601-700.txtthe path of the topic file
--topic_formatstringtrec--topic_format trecthe format of the topic file
--test_splitstringnone--test_split $(pwd)/sample_training_validation_query_ids/robust04_test.txtthe path to the file with the query ids to use for testing (the docker image is expected to compute the training topic ids which will include all topic ids excluding the ones passed in the test and validation ids files)
--validation_splitstringnone--validation_split $(pwd)/sample_training_validation_query_ids/robust04_validation.txtthe path to the file with the query ids to use for the model validation (the docker image is expected to compute the training topic ids which will include all topic ids excluding the ones passed in the test and validation ids files)
--model_folderstringnone--model_folder $(pwd)/outputthe folder to save the model trained by the docker
--qrelsstringnone--qrels $(pwd)/qrels/qrels.robust2004.txtthe qrels file for evaluation
--gpubooleanFalse--gpu Trueflag to launch docker with nvidia runtime
--opts[key]=[value] ...none--opts epochs=10extra options passed to the search script

Command Line Options - interact

Option NameTypeDefaultExampleDescription
--repostringnone--repo osirrc2019/anserinithe repo on Docker Hub
--tagstringlatest--tag latestthe tag on Docker Hub
--load_from_snapshotstringsave--load_from_snapshot robust04-exp1used to determine the tag of the snapshotted image to interact with
--exit_jigstringfalsetruedetermines whether jig exits after starting the container
--opts[key]=[value] ...none--opts interact_args="localhost:5000"extra options passed to the interact script

Docker Container Contract

Currently we support four hooks: init, index, search,and interact. We expect search or interact to be called after init and index. We also expect these four executables to be located in the root directory of the container.

Each script is executed with the interpreter determined by the shebang so you can use #!/usr/bin/env bash, #!/usr/bin/env python3, etc - just remember to make sure your Dockerfile is built with the appropriate base image or the required dependencies are installed.

init

The purpose of the init hook is to do any preparation needed for the run - this could be downloading + compiling code, downloading a pre-built artifact, or downloading external resources (pre-trained models, knowledge graphs, etc.).

The script will be executed as ./init --json <json> where the JSON string has the following format:

{
  "opts": { // extra options passed to the init script
      "<key>": "<value>"
   }
}

index

The purpose of the index hook is to build the indexes required for the run.

Before the hook is run, we will mount the document collections at a path passed to the script.

The script will be executed as: ./index --json <json> where the JSON string has the following format:

{
  "collections": [
    {
      "name": "<name>",              // the collection name
      "path": "/path/to/collection", // the collection path
      "format": "<format>"           // the collection format (trectext, trecweb, json, warc)
    },
    ...
  ],
  "opts": { // extra options passed to the index script
    "<key>": "<value>"
  },
}

train

The purpose of the train hook is to train a retrieval model.

The script will be executed as: ./train --json <json> where the JSON string has the following format:

{
  "topic": {
    "path": "/path/to/topic", // the path to the topic file
    "format": "trec"          // the format of the topic file
  },
  "qrels": {
    "path": "/path/to/qrel",  // the path to the qrel file
  },
  "model_folder": {
    "path": "/output",  // the path (in the docker image) where the output model folder (passed to the jig) is mounted
  },
  "opts": { // extra options passed to the train script
    "<key>": "<value>"
  },
}

search

The purpose of the search hook is to perform an ad-hoc retrieval run - multiple runs can be performed by calling jig multiple times with different --opts parameters.

The run files are expected to be placed in the /output directory such that they can be evaluated externally by jig using trec_eval.

The script will be executed as ./search --json <json> where the JSON string has the following format:

{
  "collection": {
    "name": "<name>"          // the collection name
  },
  "opts": { // extra options passed to the search script
    "<key>": "<value>"
  },
  "topic": {
    "path": "/path/to/topic", // the path to the topic file
    "format": "trec"          // the format of the topic file
  },
  "top_k": <int>              // the num of retrieval results for top-k retrieval
}

Note: If you're using the --timings option for the search hook, ensure that the time package (or bash) is installed in your Dockerfile.

interact

The purpose of the interact hook is to prepare for user interaction, assuming that any process started by init or index is gone.

The script will be executed as ./interact --json <json> where the JSON string has the following format:

{
  "opts": { // extra options passed to the interact script
    "<key>": "<value>"
  },
}

Note: If you need a port accessible, ensure you EXPOSE the port in your Dockerfile.

Azure Script

Run the script as follows: ./azure.sh --disk-name <disk_name> --resource-group <group> --vm-name <vm_name> --vm-size <vm_size> --run-file <file.json> --ssh-pubkey-path <path> --subscription <id>

The runs are defined in a JSON file, see azure.json as an example. Values in [] (i.e., [COLLECTION_PATH]) are replaced with the appropriate values defined in the file.

Notes

Python 3.5 or higher is required to run jig. Nvidia-docker is required to run images with gpu support, see https://github.com/NVIDIA/nvidia-docker for more details.