Home

Awesome

RobustQA-ACL23-Data

This repo describes the details of the RobustQA (ACL'23 Findings) benchmark, which consists of datasets in 8 domains.

DomainDatasetDescriptionAdapted/Annotated?
Web SearchSearchQAJeopardy! QA based on Google search engineAdapted
BiomedicalBioASQOpen-domain QA based on PubMed documentsAdapted
FinanceFiQAFinancial QA based on microblogs, reports, newsAnnotated
LifestyleLoTTEQA regrading lifestyle based on original IR data in search and forumAnnotated
RecreationLoTTEQA regarding recreation based on original IR data in search and forumAnnotated
TechnologyLoTTEQA regarding technology based on original IR data in search and forumAnnotated
ScienceLoTTEQA regarding science based on original IR data in search and forumAnnotated
WritingLoTTEQA regarding writing based on original IR data in search and forumAnnotated

Disclaimers

We've included the links to the license for each of the raw datasets. We only distribute some of the RobustQA's datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

Citation

@Inproceedings{Han2023,
 author = {Rujun Han and Peng Qi and Yuhao Zhang and Lan Liu and Juliette Burger and William Wang and Zhiheng Huang and Bing Xiang and Dan Roth},
 title = {RobustQA: Benchmarking the robustness of domain adaptation for open-domain question answering},
 year = {2023},
 url = {https://www.amazon.science/publications/robustqa-benchmarking-the-robustness-of-domain-adaptation-for-open-domain-question-answering},
 booktitle = {ACL Findings 2023},
}

Raw Data & Annotations

We only provide our new annotations without raw data. You can find them in data/. All files in this folder are tracked by Git LFS.

For the rest of the data, we provide instructions to download raw data, and process them into uniform data format for RobustQA. In general, after data processing, you can expect to have the following data and fields,

Passage file passages.jsonl and aggregated QA file qrel.jsonl are needed for the experiments in the paper.

FiQA

LoTTE

BioASQ

We only provide detailed data reproduction instruction and code below to avoid any potential issues per the following license. So, you will have to acquire the raw data on your own and run the following data processing code.

SearchQA

We only provide detailed data reproduction instruction and code below to avoid potential issues per the following license. However, we may provide final processed data upon request since the data license doesn't prohibit data distribution.

Experiment - Passage Retrieval

DPR

Following the instruction here:https://github.com/facebookresearch/DPR to install the DPR package, download NQ data and the trained models. By default,

HOME=~/robustqa
OUTDIR=~/DPR/downloads/data/

# options: searchqa bioasq fiqa lifestyle recreation science technology writing
dataset=fiqa
cd ${OUTDIR}
mkdir ${dataset}_split

cd $HOME
python code/convert_to_dpr.py --data ${dataset} --output_dir ${OUTDIR}

Then follow the instruction on the same web-page to generate embeddings and retrieve passages.

BM25 + CE

Refer to this instruction for details of installing BEIR package: https://github.com/beir-cellar/beir.

HOME=~/robustqa
# options: searchqa bioasq fiqa lifestyle recreation science technology writing
dataset=fiqa

cd $HOME
pip install beir
python code/convert_to_beir.py --${dataset}

Setup BM25,

wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
sudo chown -R daemon:daemon elasticsearch-7.9.2/
shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch

Run models python code/run_beir_models.py --data ${dataset} --model bm25 --reindex

ColBERTv2

Detailed instruction of running ColBERTv2 can be found here: https://github.com/stanford-futuredata/ColBERT. We do not repeat. After setting up the ColBERTv2 directories and environment,

COLBERT=~/ColBERT
HOME=~/robustqa

cp colbert_scripts/* $COLBERT/

# options: searchqa bioasq fiqa lifestyle recreation science technology writing
dataset=fiqa

python code/convert_to_colbert.py --data ${dataset} --output_dir $COLBERT/data

Running ColBERTv2 consists of four steps,

  1. Download ColBERTv2 checkpoint to $COLBERT/downloads/colbertv2.0'
  2. Indexing passages
python colbert_scripts/run_colbert.py \
    --dataroot ${COLBERT}/data/ \
    --dataset  ${dataset}
    --model $COLBERT/downloads/colbertv2.0 \
    --index
  1. Search top passages
python colbert_scripts/run_colbert.py \
    --dataroot ${COLBERT}/data/ \
    --dataset  ${dataset}
    --model $COLBERT/downloads/colbertv2.0 \
    --search

This step will save a *ranking.tsv file into $COLBERT/experiments. Locate this file's path (ranking_file_path) 4. Compute performance and save retrieval results

python colbert_scripts/run_colbert.py \
    --dataroot ${COLBERT}/data/ \
    --dataset  ${dataset}
    --model $COLBERT/downloads/colbertv2.0 \
    --eval \
    --ranking_file ${ranking_file_path} 

This step will save a retrieved passage file {dataset}_from_colbert_{split}.json under ${COLBERT}/output/. This file is in the same format of the retrieved passage file from DPR above, and can be directory used as the input to the extractive QA model.

Experiment - Question Answering

DPR's Extractive QA

To run the extractive QA model inference, download the best QA model checkpoint from https://github.com/facebookresearch/DPR. Using the following script,

python train_extractive_reader.py \
  prediction_results_file={path to a file to write the results to} \
  eval_top_docs=[10,20,40,50,80,100] \
  dev_files={path to the retriever results file to evaluate} \
  model_file= {path to the reader checkpoint} \
  train.dev_batch_size=80 \
  passages_per_question_predict=100 \
  encoder.sequence_length=350

Since we use ColBERTv2 as the default retriever for the paper, dev_files needs to be set to the path of the {dataset}_from_colbert_{split}.json files from ColBERTv2. See details above.

Atlas

Atlas model doesn't not require ColBERTv2 to provide retrieved passages since it has its own dense retriever. Following the intruction here: https://github.com/facebookresearch/atlas to set up the project repo and install environment. Models checkpoints we experimented in the paper are atlas-xxl_nq and atlas-base_nq.

Convert RobustQA data into Atlas format,

ATLAS=~/atlas
HOME=~/robustqa

# options: searchqa bioasq fiqa lifestyle recreation science technology writing
dataset=fiqa

python code/convert_to_atlas.py --data ${dataset} --output_dir $ATLAS/data

Run inference,

cd $ATLAS
export NGPU=8
model_size=base
n_cxt=40
split=test

python -m torch.distributed.launch --nproc_per_node=8  evaluate.py \
    --name run_atlas_nq_${model_size}_${n_cxt}_${dataset} \
    --generation_max_length 16 \
    --gold_score_mode "pdist" \
    --precision bf16 \
    --per_gpu_embedder_batch_size 128 \
    --reader_model_type google/t5-${model_size}-lm-adapt \
    --text_maxlength 200 \
    --model_path $ATLAS/models/atlas_nq/${model_size} \
    --eval_data $ATLAS/data/${dataset}-${split}.jsonl \
    --per_gpu_batch_size 1 \
    --n_context ${n_cxt} --retriever_n_context ${n_cxt} \
    --main_port -1 \
    --index_mode "flat"  \
    --task "qa" \
    --passages  $ATLAS/data/${dataset}-passages.jsonl \

Security

See CONTRIBUTION for more information

Licence

This project is licensed under the Apache-2.0 License.