Awesome
RobustQA-ACL23-Data
This repo describes the details of the RobustQA (ACL'23 Findings) benchmark, which consists of datasets in 8 domains.
Domain | Dataset | Description | Adapted/Annotated? |
---|---|---|---|
Web Search | SearchQA | Jeopardy! QA based on Google search engine | Adapted |
Biomedical | BioASQ | Open-domain QA based on PubMed documents | Adapted |
Finance | FiQA | Financial QA based on microblogs, reports, news | Annotated |
Lifestyle | LoTTE | QA regrading lifestyle based on original IR data in search and forum | Annotated |
Recreation | LoTTE | QA regarding recreation based on original IR data in search and forum | Annotated |
Technology | LoTTE | QA regarding technology based on original IR data in search and forum | Annotated |
Science | LoTTE | QA regarding science based on original IR data in search and forum | Annotated |
Writing | LoTTE | QA regarding writing based on original IR data in search and forum | Annotated |
Disclaimers
We've included the links to the license for each of the raw datasets. We only distribute some of the RobustQA's datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.
Citation
@Inproceedings{Han2023,
author = {Rujun Han and Peng Qi and Yuhao Zhang and Lan Liu and Juliette Burger and William Wang and Zhiheng Huang and Bing Xiang and Dan Roth},
title = {RobustQA: Benchmarking the robustness of domain adaptation for open-domain question answering},
year = {2023},
url = {https://www.amazon.science/publications/robustqa-benchmarking-the-robustness-of-domain-adaptation-for-open-domain-question-answering},
booktitle = {ACL Findings 2023},
}
Raw Data & Annotations
We only provide our new annotations without raw data. You can find them in data/.
All files in this folder are tracked by Git LFS.
For the rest of the data, we provide instructions to download raw data, and process them into uniform data format for RobustQA. In general, after data processing, you can expect to have the following data and fields,
documents.jsonl
: original document pool. Data fields are,doc_id
: document idtitle
: document titletext
: document textmeta_data
: optional
annotations.jsonl
: extractive QA annotations in the original document. Data fields are,qid
: question idquestion
: question textdocuments
:answers
: answer span annotated in the documentdoc_id
: same as abovetitle
: same as abovetext
: same as above
passages.jsonl
: split document texts indocuments.jsonl
by 100 words (based on white space). Data fields are the same asdocument.jsonl
except forpid = doc_id-k
where k is the k-th split of a document (0-based)
qrel.jsonl
: aggregate all answers per questionqid
: question idquestion
: original questionanswers
: aggregated answers across different documents.
Passage file passages.jsonl
and aggregated QA file qrel.jsonl
are needed for the experiments in the paper.
FiQA
- License: There is no data license specfied https://sites.google.com/view/fiqa/home. We only deep
doc_id
andqid
in the published annotation files. - Download the raw corpus
FiQA_train_doc_final.tsv
and question fileFiQA_train_question_final.tsv
intodata/fiqa
from https://drive.google.com/file/d/1BlWaV-qVPfpGyJoWQJU9bXQgWCATgxEP/view. - To replicate
documents.jsonl
andannotations.jsonl
, runpython code/process_raw.py --dataset fiqa
LoTTE
- Download raw data here: https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz into
data/lotte
. - Annotations: there is no data license specfied https://github.com/stanford-futuredata/ColBERT/blob/main/LoTTE.md. We only keep
doc_id
andqid
in the published annotation files. - To replicate
documents.jsonl
andannotations.jsonl
, runpython code/process_raw.py --dataset {lifestyle|recreation|technology|science|writing} --split {test|dev}
BioASQ
We only provide detailed data reproduction instruction and code below to avoid any potential issues per the following license. So, you will have to acquire the raw data on your own and run the following data processing code.
- License: https://creativecommons.org/licenses/by/2.5/
- Register an account here: http://bioasq.org/
- download document collecction
allMeSH_2021.zip
andunzip allMeSH_2021.zip
- download the test annotations
{2-9}B{1-5}_golden.json
- move both documents and annotations to
data/bioasq/
python code/process_raw.py --dataset bioasq
SearchQA
We only provide detailed data reproduction instruction and code below to avoid potential issues per the following license. However, we may provide final processed data upon request since the data license doesn't prohibit data distribution.
- License: https://github.com/nyu-dl/dl4ir-searchQA/blob/master/LICENSE
- Raw data -
{train|val|test}.zip
can be download from here: https://drive.google.com/drive/u/2/folders/1kBkQGooNyG0h8waaOJpgdGtOnlb1S649 mkdir -p data/searchqa/{train|val|test}
mv {train|val|test}.zip data/searchqa/{train|val|test}
unzip data/searchqa/{train|val|test}.zip
python code/process_raw.py --dataset searchqa
Experiment - Passage Retrieval
DPR
Following the instruction here:https://github.com/facebookresearch/DPR to install the DPR package, download NQ data and the trained models. By default,
- Passages are saved here:
DPR/downloads/data/
and qa annotations are saved hereDPR/downloads/data/retriever/qas/
. - Model checkpoints are saved here:
DPR/downloads/checkpoint/retriever/
- we use
single-adv-hn
model for retrieval experiments.
- we use
HOME=~/robustqa
OUTDIR=~/DPR/downloads/data/
# options: searchqa bioasq fiqa lifestyle recreation science technology writing
dataset=fiqa
cd ${OUTDIR}
mkdir ${dataset}_split
cd $HOME
python code/convert_to_dpr.py --data ${dataset} --output_dir ${OUTDIR}
Then follow the instruction on the same web-page to generate embeddings and retrieve passages.
BM25 + CE
Refer to this instruction for details of installing BEIR package: https://github.com/beir-cellar/beir.
HOME=~/robustqa
# options: searchqa bioasq fiqa lifestyle recreation science technology writing
dataset=fiqa
cd $HOME
pip install beir
python code/convert_to_beir.py --${dataset}
Setup BM25,
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
sudo chown -R daemon:daemon elasticsearch-7.9.2/
shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch
Run models python code/run_beir_models.py --data ${dataset} --model bm25 --reindex
ColBERTv2
Detailed instruction of running ColBERTv2 can be found here: https://github.com/stanford-futuredata/ColBERT. We do not repeat. After setting up the ColBERTv2 directories and environment,
- move scripts to the ColBERT folder
- convert RobustQA data into ColBERT format
COLBERT=~/ColBERT
HOME=~/robustqa
cp colbert_scripts/* $COLBERT/
# options: searchqa bioasq fiqa lifestyle recreation science technology writing
dataset=fiqa
python code/convert_to_colbert.py --data ${dataset} --output_dir $COLBERT/data
Running ColBERTv2 consists of four steps,
- Download ColBERTv2 checkpoint to
$COLBERT/downloads/colbertv2.0'
- Indexing passages
python colbert_scripts/run_colbert.py \
--dataroot ${COLBERT}/data/ \
--dataset ${dataset}
--model $COLBERT/downloads/colbertv2.0 \
--index
- Search top passages
python colbert_scripts/run_colbert.py \
--dataroot ${COLBERT}/data/ \
--dataset ${dataset}
--model $COLBERT/downloads/colbertv2.0 \
--search
This step will save a *ranking.tsv
file into $COLBERT/experiments
. Locate this file's path (ranking_file_path
)
4. Compute performance and save retrieval results
python colbert_scripts/run_colbert.py \
--dataroot ${COLBERT}/data/ \
--dataset ${dataset}
--model $COLBERT/downloads/colbertv2.0 \
--eval \
--ranking_file ${ranking_file_path}
This step will save a retrieved passage file {dataset}_from_colbert_{split}.json
under ${COLBERT}/output/
. This file is in the same format of the retrieved passage file from DPR above, and can be directory used as the input to the extractive QA model.
Experiment - Question Answering
DPR's Extractive QA
To run the extractive QA model inference, download the best QA model checkpoint from https://github.com/facebookresearch/DPR. Using the following script,
python train_extractive_reader.py \
prediction_results_file={path to a file to write the results to} \
eval_top_docs=[10,20,40,50,80,100] \
dev_files={path to the retriever results file to evaluate} \
model_file= {path to the reader checkpoint} \
train.dev_batch_size=80 \
passages_per_question_predict=100 \
encoder.sequence_length=350
Since we use ColBERTv2 as the default retriever for the paper, dev_files
needs to be set to the path of the {dataset}_from_colbert_{split}.json
files from ColBERTv2. See details above.
Atlas
Atlas model doesn't not require ColBERTv2 to provide retrieved passages since it has its own dense retriever. Following the intruction here: https://github.com/facebookresearch/atlas to set up the project repo and install environment. Models checkpoints we experimented in the paper are atlas-xxl_nq
and atlas-base_nq
.
Convert RobustQA data into Atlas format,
ATLAS=~/atlas
HOME=~/robustqa
# options: searchqa bioasq fiqa lifestyle recreation science technology writing
dataset=fiqa
python code/convert_to_atlas.py --data ${dataset} --output_dir $ATLAS/data
Run inference,
cd $ATLAS
export NGPU=8
model_size=base
n_cxt=40
split=test
python -m torch.distributed.launch --nproc_per_node=8 evaluate.py \
--name run_atlas_nq_${model_size}_${n_cxt}_${dataset} \
--generation_max_length 16 \
--gold_score_mode "pdist" \
--precision bf16 \
--per_gpu_embedder_batch_size 128 \
--reader_model_type google/t5-${model_size}-lm-adapt \
--text_maxlength 200 \
--model_path $ATLAS/models/atlas_nq/${model_size} \
--eval_data $ATLAS/data/${dataset}-${split}.jsonl \
--per_gpu_batch_size 1 \
--n_context ${n_cxt} --retriever_n_context ${n_cxt} \
--main_port -1 \
--index_mode "flat" \
--task "qa" \
--passages $ATLAS/data/${dataset}-passages.jsonl \
Security
See CONTRIBUTION for more information
Licence
This project is licensed under the Apache-2.0 License.