Awesome

XOR QA: Cross-lingual Open-Retrieve Question Answering

Introduction

XOR-TyDi QA brings together for the first time information-seeking questions, open-retrieval QA, and multilingual QA to create a multilingual open-retrieval QA dataset that enables cross-lingual answer retrieval. It consists of questions written by information-seeking native speakers in 7 typologically diverse languages and answer annotations that are retrieved from multilingual document collections.

The Tasks

There are three sub-tasks: XOR-Retrieve, XOR-EnglishSpan, and XOR-Full.

task_overview

XOR-Retrieve

XOR-Retrieve is a cross-lingual retrieval task where a question is written in the target language (e.g., Japanese) and a system is required to retrieve English document that answers the question.

XOR-EnglishSpan

XOR-English Span is a cross-lingual retrieval task where a question is written in the target language (e.g., Japanese) and a system is required to output a short answer in English.

XOR-Full

XOR-Full is a cross-lingual retrieval task where a question is written in the target language (e.g., Japanese) and a system is required to output a short answer in the target language.

Download the Dataset

You can download the data at the following URLs.

The datasets below include question and short answer information only. If you need the long answer information for supervised training of retrievers or reader, please download the GoldParagraph data.

We also ask you to use Wikipedia 2019-0201 dump, which can be downloaded the link from TyDiQA's source data list for the 7 languages + English.

Note (April 12, 2021): Please note that we modified the XOR-TyDi QA data, and released a new version as XOR-TyDi (v1.1). All of the data you can download are from here is v1.1 and the leaderboard results are based on v1.1.

Data for XOR tasks

For XOR-Retrieve and XOR-English Span:

For XOR-Full:

Additional resources

Gold Paragraph Data

Gold Paragraph data (similar to TyDi GP): The gold paragraph data includes annotated passage answers (gold paragraph) and short answers in the same format as SQuAD. As in TyDi QA Gold Passage Task, you can directly recycle your SQuAD QA codes.
Reading Comprehension data with no answer and yes/no: With a slightly different from the original SQuAD format, the data includes all "long answer only" & "yes-no" questions, as in NQ and TyDi QA full task.

Question translation data

We also make the human annotated 30k question translation data publicly available. As the translation data is only used for annotation and we do not expect use this oracle translation, we release the translation for train data only.

The translation data for each language pair (English-{Arabic, Bengali, Finnish, Japanese, Korean, Russian, Telugu}) is represented as a pair of text file, in which each line include one sentence corresponding to the translated English question, following common MT corpora.

The list of the links to parallel corpora (L_i-to-English) is below:

Building a baseline system

Our baseline includes: Dense Passage Retriever (Karpukhin et al., 2020), Path Retriever (Asai et al., 2020), BM25 (implementations are based on ElasticSearch)+multilingual QA models.

Please see baselines/README.md for more information.

Evaluation

To evaluate your modes' predictions on development data, please run the commands below. Please see the details of the prediction file format and make sure your prediction results follow the format.

You also needs to install MeCab and NLTK before running evaluation -- they are used to tokenization for XOR-Retrieve and for evaluations on Japanese answers.

pip install mecab-python3
pip install unidic-lite
pip install nltk

XOR Retrieve (metrics: R@2kt, R@5kt)

python3 evals/eval_xor_retrieve.py \
    --data_file <path_to_input_data> \
    --pred_file <path_to_predictions>

XOR-Full (metrics: F1, EM)

python3 evals/eval_xor_engspan.py \
    --data_file <path_to_input_data> \
    --pred_file <path_to_predictions>

XOR-English Span (metrics: F1, EM, BLEU)

python3 evals/eval_xor_full.py \
    --data_file <path_to_input_data> \
    --pred_file <path_to_predictions>

Prediction file format

To evaluate your model's predictions, you need to format the predicted results in specific formats.

XOR-Retrieve

Note: Our evaluation script evaluate if the correct answers are included in the first 2,000 tokens and 5,000 tokens for R@2kt and R@5kt, respectively. Please make sure the total token numbers of your retrieved document would be larger than those tokens; otherwise your scores might be underestimated. See the detailed definition of those metrics in our paper.

The XOR-Retrieve file should be output as follows:

["id": 12345, "lang": "ja, ctxs": ["Tokyo (東京) is the capital and most populous prefecture of Japan.", "Located at the head of Tokyo Bay, the prefecture forms part of the Kantō region on the central Pacific coast of Japan's main island, Honshu. " ... ]
]

XOR-Full, XOR-English Span

For those two tasks, your prediction file should be a json file of a dictionary, whose keys are question ids and values are the predicted short answers.

e.g.,

{"12345": "東京", "67890": "Москва", ...}

Submission guide

If you want to submit to our leaderboard, please create the prediction files on our test data for your target task, and email it to Akari Asai (akari[at]cs.washington.edu).

Please make sure you include the following information in the email.

test prediction file in our prediction file format
the task name
the name of the model
whether you use external black box API (e.g., Google Translate / Google Custom Search / Bing Search) which cannot be reproduced on your side. To coordinate open research, we encourage you to use reproducible components, and the system with those external (unreproducible) components will be populated below the entries w/o APIs.
[optional] the institution. You can update those information later.
[optional] link to the paper and code. You can update those information later.

Notes

The models will be sorted by F1 score for XOR-English Span and XOR-Full, and by R@5kt for XOR-Retrieve.
Please perform any model selection or ablation necessary to improve model performance on the dev set only. We limit the number of submissions to be 20 per year and 3 per month.
If you plan to have your model officially evaluated, please plan 1 weeks in advance to allow sufficient time for your model results to be on the leaderboard.

Updates

[October 23, 2020]: We released the initial version of TyDi-XOR dataset and preprint paper.
[January 18, 2021]: We released codes and trained models. Please see the details at baselines.
[April 13, 2021]: Our papaer's camera-ready version is now available at Arxiv. We also did some minor changes to XOR-TyDi's evaluation data and released new version of TyDi-XOR as TYDi-XOR v1.1. The changes are: (1) we filtered out a few yes/no answer annotations (e.g., yes/no answers to factoid questions) that are potentially incorrect, and (2) we added some answer translations that are not appropriately included in previous XOR QA's full evaluations due to postprocessing issues.

Citation and Contact

If you find this codebase is useful or use the data in your work, please cite our paper.

@inproceedings{xorqa,
    title   = {{XOR} {QA}: Cross-lingual Open-Retrieval Question Answering},
    author  = {Akari Asai and Jungo Kasai and Jonathan H. Clark and Kenton Lee and Eunsol Choi and Hannaneh Hajishirzi},
    booktitle={NAACL-HLT},
    year    = {2021}
}

If you use TyDi-XOR QA data, please also make sure to cite the original TyDi QA paper, which we built TyDI-XOR off of:

@article{tydiqa,
title   = {TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages},
author  = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki}
journal = {TACL},
year    = {2020}
}

Please contact Akari Asai (@AkariAsai, akari[at]cs.washington.edu) for questions and suggestions.