Awesome
OpenNIR
An end-to-end neural ad-hoc ranking pipeline.
Quick start
OpenNIR requires Python 3.6 (not tested with other versions). Java 11 is required (for Anserini).
- OpenNIR can also be run in Docker; you can find instructions here.
Install dependencies
pip install -r requirements.txt
Train and validate a model (here, ConvKNRM on ANTIQUE):
scripts/pipeline.sh config/conv_knrm config/antique
(Performance on the test set can be obtained by adding pipeline.test=True
)
Grid serach for BM25 over ANTIQUE for comparision with neural model performance:
scripts/pipeline.sh config/grid_search config/antique
(Performance on the test set can be obtained by adding pipeline.test=True
)
Models, datasets, and vocabularies will be saved in ~/data/onir/
. This can be overridden by
setting data_dir=~/some/other/place/
as a command line argument, in a configuration file, or in
the ONIR_ARGS
environment variable.
Features
Rankers
- DRMM
ranker=drmm
paper - Duet (local model)
ranker=duetl
paper - MatchPyramid
ranker=matchpyramid
paper - KNRM
ranker=knrm
paper - PACRR
ranker=pacrr
paper - ConvKNRM
ranker=conv_knrm
paper - Vanilla BERT
config/vanilla_bert
paper - CEDR models
config/cedr/[model]
paper - MatchZoo models source
- MatchZoo's KNRM
ranker=mz_knrm
- MatchZoo's ConvKNRM
ranker=mz_conv_knrm
- MatchZoo's KNRM
Datasets
- TREC Robust 2004
config/robust/fold[x]
- MS-MARCO
config/msmarco
- ANTIQUE
config/antique
- TREC CAR
config/car
- New York Times
config/nyt
-- for content-based weak supervision - TREC Arabic, Mandarin, and Spanish
config/multiling/*
-- for zero-shot multilingual transfer learning (instructions)
Evaluation Metrics
New: Any measure from the ir-measures package.
map
(from trec_eval)ndcg
(from trec_eval)ndcg@X
(from trec_eval, gdeval)p@X
(from trec_eval)err@X
(from gdeval)mrr
(from trec_eval)rprec
(from trec_eval)judged@X
(implemented in python)
Vocabularies
- Binary term matching
vocab=binary
(i.e., changes interaction matrix from cosine similarity to to binary indicators) - Pretrained word vectors
vocab=wordvec
vocab.source=fasttext
vocab.variant=wiki-news-300d-1M
,vocab.variant=crawl-300d-2M
- (information about FastText variants can be found here)
vocab=source=glove
vocab.variant=cc-42b-300d
,vocab.variant=cc-840b-300d
- (information about GloVe variants can be found here)
vocab.source=convknrm
vocab.variant=knrm-bing
vocab.variant=knrm-sogou
,vocab.variant=convknrm-bing
vocab.variant=convknrm-sogou
- (information about ConvKNRM word embedding variants can be found here)
vocab.source=bionlp
vocab.variant=pubmed-pmc
- (information about BioNLP variants can be found here)
- Pretrained word vectors w/ single UNK vector for unknown terms
vocab=wordvec_unk
- (with above word embedding sources)
- Pretrained word vectors w/ hash-based random selection for unknown terms
vocab=wordvec_hash
(defualt)- (with above word embedding sources)
- BERT contextualized embeddings
vocab=bert
- Core models (from HuggingFace):
vocab.bert_base=bert-base-uncased
(default),vocab.bert_base=bert-large-uncased
,vocab.bert_base=bert-base-cased
,vocab.bert_base=bert-large-cased
,vocab.bert_base=bert-base-multilingual-uncased
,vocab.bert_base=bert-base-multilingual-cased
,vocab.bert_base=bert-base-chinese
,vocab.bert_base=bert-base-german-cased
,vocab.bert_base=bert-large-uncased-whole-word-masking
,vocab.bert_base=bert-large-cased-whole-word-masking
,vocab.bert_base=bert-large-uncased-whole-word-masking-finetuned-squad
,vocab.bert_base=bert-large-cased-whole-word-masking-finetuned-squad
,vocab.bert_base=bert-base-cased-finetuned-mrpc
- SciBERT:
vocab.bert_base=scibert-scivocab-uncased
,vocab.bert_base=scibert-scivocab-cased
,vocab.bert_base=scibert-basevocab-uncased
,vocab.bert_base=scibert-basevocab-cased
- BioBERT
vocab.bert_base=biobert-pubmed-pmc
,vocab.bert_base=biobert-pubmed
,vocab.bert_base=biobert-pmc
- Core models (from HuggingFace):
Citing OpenNIR
If you use OpenNIR, please cite the following WSDM demonstration paper:
@InProceedings{macavaney:wsdm2020-onir,
author = {MacAvaney, Sean},
title = {{OpenNIR}: A Complete Neural Ad-Hoc Ranking Pipeline},
booktitle = {{WSDM} 2020},
year = {2020}
}
Acknowledgements
I gratefully acknowledge support for this work from the ARCS Endowment Fellowship. I thank Andrew Yates, Arman Cohan, Luca Soldaini, Nazli Goharian, and Ophir Frieder for valuable feedback on the manuscript and/or code contributions to OpenNIR.