Home

Awesome

PyTerrier_ANCE

This is the PyTerrier plugin for the ANCE dense passage retriever.

Installation

This repostory can be installed using Pip.

pip install --upgrade git+https://github.com/terrierteam/pyterrier_ance.git

You will need FAISS (cpu or gpu) installed:

On Colab:

!pip install faiss-cpu 

On Anaconda:

# CPU-only version
$ conda install -c pytorch faiss-cpu

# GPU(+CPU) version
$ conda install -c pytorch faiss-gpu

For ANCE, the CPU version is sufficient.

Indexing

You will need a pre-trained ANCE checkpoint. There are several available from the ANCE repository.

Then, indexing is as easy as instantiating the indexer, pointing at the (unzipped) checkpoint and the directory in which you wish to create an index


dataset = pt.get_dataset("irds:vaswani")
import pyterrier_ance
indexer = pyterrier_ance.ANCEIndexer("/path/to/checkpoint", "/path/to/anceindex")
indexer.index(dataset.get_corpus_iter())

Retrieval

You can instantiate the retrieval transformer, again by specifying the checkpoint location and the index location:

anceretr = pyterrier_ance.ANCERetrieval("/path/to/checkpoint", "/path/to/anceindex")

Thereafter, you can use it in the normal PyTerrier way, for instance in an experiment:

pt.Experiment(
    [anceretr], 
    dataset.get_topics(), 
    dataset.get_qrels(), 
    eval_metrics=["map"]
)

You can also use ANCE as a re-ranker to score text (e.g., as a re-ranker) using ANCETextScorer.

ance_text_scorer = pyterrier_ance.ANCETextScorer("/path/to/checkpoint")
# You'll need to use this in a retrieval pipeline that includes the document text, e.g.:
# bm25 >> pt.text.get_text(dataset, 'text') >> ance_text_scorer

Documents longer than Passages

If your documents are longer than passages, you should apply passaging to them before indexing, and max passage (say) during retrieval:


# indexing
dataset = pt.get_dataset("irds:vaswani")
import pyterrier_ance
indexer = pt.text.sliding("text", prepend_attr=None) >> pyterrier_ance.ANCEIndexer("/path/to/checkpoint", "/path/to/anceindex")
indexer.index(dataset.get_corpus_iter())

# retrieval 

ance_maxp = pyterrier_ance.ANCERetrieval("/path/to/checkpoint", "/path/to/anceindex") >> pt.text.max_passage()

Examples

Checkout out the notebooks, even on Colab:

The Terrier data repository contains ANCE indices for several corpora, including Vaswani and MSMARCO Passage v1.

Implementation Details

We use a fork-ed copy of ANCE that makes it pip installable, and addresses other quibbles.

References

Credits