Home

Awesome

<a id="contents"></a>

Contents

<!-- MarkdownTOC --> <!-- /MarkdownTOC -->

ART is an auto-encoding based retriever training algorithm developed for the task of passage retrieval.

<p align="center"> <img src="images/art-model.png"> </p>

ART maximizes the retrieved passage likelihood computed from the dense retriever by considering the language model question reconstruction score conditioned on the passage as a soft-label. Colored blocks indicate trainable parameters. Red arrows show gradient flow during backpropagation.

<a id="setup"></a>

Setup

<pre> sudo docker run --ipc=host --gpus all -it --rm -v /mnt/disks:/mnt/disks nvcr.io/nvidia/pytorch:22.01-py3 bash </pre>

, where /mnt/disks is the directory to be mounted.

<a id="downloading-data-and-checkpoints"></a>

Downloading Data and Checkpoints

We have provided datasets and initial retriever checkpoints to train models for dense retrieval.

We have also provided a script download_data.sh that will download all the required datasets. Run this script by providing a directory path in the first argument.

bash examples/helper-scripts/download_data.sh DIRNAME

These files can also be downloaded separately by using the wget command-line utility and the links provided below.

Required data files for training

The BERT pre-tokenized evidence file(s) can also be obtained by the command:

python tools/create_evidence_indexed_dataset.py --input /mnt/disks/project/data/wikipedia-split/psgs_w100.tsv --tsv-keys text title --tokenizer-type BertWordPieceLowerCase --vocab-file /mnt/disks/project/bert-vocab/bert-large-uncased-vocab.txt --output-prefix wikipedia-evidence-bert --workers 25 

The T0/T5 pre-tokenized evidence file(s) can also be obtained by the command:

python tools/create_evidence_indexed_dataset_t0.py --input /mnt/disks/project/data/wikipedia-split/psgs_w100.tsv --tsv-keys text title  --output-prefix wikipedia-evidence-t0 --workers 25

Required checkpoints and pre-computed evidence embeddings

The evidence embeddings for a retriever checkpoint can be computed and evaluated with the command

bash examples/indexer-scripts/create_evidence_embeddings_and_evaluate.sh RETRIEVER_CHECKPOINT_PATH

Please ensure to change the data path in this script.

For example, to compute the Wikipedia evidence embeddings corresponding to the above MSS retreiver checkpoint and evaluate it on NQ-Open dev and test sets, it can be done with

bash examples/indexer-scripts/create_evidence_embeddings_and_evaluate.sh mss-retriever-base/iter_0082000

<a id="training"></a>

Training

bash examples/zero-shot-retriever-training/art-nq-T0-3B.sh 2>&1 | tee art-training-T0-3B-log.txt
bash examples/zero-shot-retriever-training/art-nq-t5-lm-adapted-11B.sh 2>&1 | tee art-training-T5-lm-adapted-11B-log.txt
RETRIEVER_CHECKPOINT_PATH=${CHECKPOINT_PATH}"-tmp"
python tools/save_art_retriever.py --load ${CHECKPOINT_PATH} --save ${RETRIEVER_CHECKPOINT_PATH} --submodel-name "retriever"

<a id="pre-trained-checkpoints"></a>

Pre-trained Checkpoints

bash examples/indexer-scripts/create_evidence_embeddings_and_evaluate.sh RETRIEVER_CHECKPOINT_PATH/iter_000xxyy

Please ensure to change the data path in this script.

Top-20 / top-100 accuracy when trained using questions from each dataset.

RetrieverCross-Attention PLMSQuAD-OpenTriviaQANQ-OpenWebQ
ARTT5-lm-adapt (11B)74.2 / 84.3 (url)82.5 / 86.6 (url)80.2 / 88.4 (url)74.4 / 82.7 (url)
ART-Multi (url)T5-lm-adapt (11B)72.8 / 83.282.2 / 86.681.5 / 88.574.8 / 83.7
ARTT0 (3B)75.3 / 85.0 (url)82.9 / 87.1 (url)81.6 / 89.0 (url)75.7 / 84.3 (url)
ART-Multi (url)T0 (3B)74.7 / 84.582.9 / 87.082.0 / 88.976.6 / 85.0

Top-20 / top-100 accuracy when trained using all the questions released in the Natural Questions dataset (NQ-Full) and / or MS MARCO.

Training QuestionsCheckpointCross-Attention PLMSQuAD-OpenTriviaQANQ-OpenWebQ
NQ-FullurlT5-lm-adapt (11B)67.3 / 79.079.4 / 84.981.7 / 88.873.4 / 82.9
NQ-FullurlT0 (3B)69.4 / 81.180.3 / 85.782.0 / 88.974.3 / 83.9
MS MARCOurlT0 (3B)68.4 / 80.478.0 / 84.177.8 / 86.274.8 / 83.2
MS MARCO + NQ-FullurlT0 (3B)69.6 / 81.180.7 / 85.782.3 / 89.175.3 / 84.5

Scaling up ART training to large configuration of retriever

Evaluation SplitConfigCross-Attention PLMNQ-OpenTriviaQA
DevBaseT0 (3B)80.6 / 87.4 (url)83.6 / 87.4 (url)
DevLargeT0 (3B)81.0 / 87.8 (url)83.7 / 87.5 (url)
Evaluation SplitConfigCross-Attention PLMNQ-OpenTriviaQA
TestBaseT0 (3B)81.6 / 89.082.9 / 87.1
TestLargeT0 (3B)82.1 / 88.883.6 / 87.6

BEIR Benchmark Experiments

On the BEIR benchmark, ART obtains competitve results with BM25 showcasing its effectiveness on ad-hoc retrieval tasks. Please see Table 9 in the paper for a full discussion of results. To reproduce ART's results in Table 9, please follow these steps.

<p align="center"> <img src="images/beir-benchmark-results.png"> </p>

Download Required Data and MSMARCO Checkpoint

We have provided a script download_data_beir.sh that will download all the required datasets and checkpoints. Run this script by providing a directory path in the first argument.

bash examples/beir/download_data_beir.sh DIRNAME

These files can also be downloaded individually as:

Evaluation Scripts

bash examples/beir/runner_beir.sh /mnt/disks/project/checkpoints/msmarco-mss-base-init-bs512-topk4-epochs10 2>&1 | tee beir-eval-using-msmarco-chkpt.txt
bash examples/beir/runner_cqadupstack.sh /mnt/disks/project/checkpoints/msmarco-mss-base-init-bs512-topk4-epochs10 2>&1 | tee cqadupstack-eval-using-msmarco-chkpt.txt

Helper Scripts

python tools/create_evidence_indexed_dataset.py --input /mnt/disks/project/data/dpr/wikipedia_split/psgs_w100.tsv --tsv-keys text title --tokenizer-type BertWordPieceLowerCase --vocab-file /mnt/disks/project/bert-vocab/bert-large-uncased-vocab.txt --output-prefix wikipedia-evidence --workers 25

<a id="issues"></a>

Issues

For any errors or bugs in the codebase, please either open a new issue or send an email to Devendra Singh Sachan (sachan.devendra@gmail.com) .

<a id="citation"></a>

Citation

If you find these codes useful, please consider citing our paper as:

@article{sachan2021questions,
    title={Questions Are All You Need to Train a Dense Passage Retriever},
    author={Devendra Singh Sachan and Mike Lewis and Dani Yogatama and Luke Zettlemoyer and Joelle Pineau and Manzil Zaheer},
    journal={Transactions of the Association for Computational Linguistics},
    year={2022},
    url={https://arxiv.org/abs/2206.10658}
}