Home

Awesome

MRQA 2019 Shared Task on Generalization

Overview

The MRQA 2019 Shared Task focuses on generalization in question answering. An effective question answering system should do more than merely interpolate from the training set to answer test examples drawn from the same distribution: it should also be able to extrapolate to out-of-distribution examples — a significantly harder challenge.

The format of the task is extractive question answering. Given a question and context passage, systems must find the word or phrase in the document that best answers the question. While this format is somewhat restrictive, it allows us to leverage many existing datasets, and its simplicity helps us focus on out-of-domain generalization, instead of other important but orthogonal challenges.

We release an official training dataset containing examples from existing extractive QA datasets, and evaluate submitted models on ten hidden test datasets. Both train and test datasets have the same format described above, but may differ in some of the following ways:

Each participant will submit a single QA system trained on the provided training data. We will then privately evaluate each system on the hidden test data.

This repository contains resources for accessing the official training and development data. If you are interested in participating, please fill out this form! We will e-mail participants who sign up of any important announcements regarding the shared task.

Quick Links

Datasets

Updated 7/12/2019 to correct for minor exact-match discrepancies (See #11 for details.)

Updated 6/13/2019 to correct for duplicate context in HotpotQA (See #7 for details.)

Updated 5/29/2019 to correct for truncated detected_answers field (See #5 for details.)

We have adapted several existing datasets from their original formats and settings to conform to our unified extractive setting. Most notably:

A span is judged to be an exact match if it matches the answer string after performing normalization consistent with the SQuAD dataset. Specifically:

Training Data

DatasetDownloadMD5SUMExamples
SQuADLinkefd6a551d2697c20a694e933210489f886,588
NewsQALink182f4e977b849cb1dbfb796030b9144474,160
TriviaQALinke18f586152612a9358c22f5536bfd32a61,688
SearchQALink612245315e6e7c4d8446e5fcc3dc1086117,384
HotpotQALinkd212c7b3fc949bd0dc47d124e8c3490772,928
NaturalQuestionsLinke27d27bf7c49eb5ead43cef3f41de6be104,071

Development Data

In-Domain

DatasetDownloadMD5SUMExamples
SQuADLink05f3f16c5c31ba8e46ff5fa80647ac4610,507
NewsQALink5c188c92a84ddffe2ab590ac7598bde24,212
TriviaQALink5c9fdc633dfe196f1b428c81205fd82f7,785
SearchQALink9217ad3f6925c384702f2a4e6d520c3816,980
HotpotQALink125a96846c830381a8acff110ff6bd845,904
NaturalQuestionsLinkc0347eebbca02d10d1b07b9a64efe61d12,836

Note: This in-domain data may be used for helping develop models. The final testing, however, will only contain out-of-domain data.

Out-of-Domain

DatasetDownloadMD5SUMExamples
BioASQLink70752a39beb826a022ab21353cb66e541,504
DROPLink070eb2ac92d2b2fc1b99abeda97ac37a1,503
DuoRCLinkb325c0ad2fa10e699136561ee70c5ddd1,501
RACELinkba8063647955bbb3ba63e9b17d82e815674
RelationExtractionLink266be75954fcb31b9dbfa9be7a61f0882,948
TextbookQALink8b52d21381d841f8985839ec41a6c7f71,503

Note: As previously mentioned, the out-of-domain dataset have been modified from their original settings to fit the unified MRQA Shared Task paradigm (see MRQA Format). Once again, at a high level, the following two major modifications have been made:

  1. All QA-context pairs are extractive. That is, the answer is selected from the context and not via, e.g., multiple-choice.
  2. All contexts are capped at a maximum of 800 tokens. As a result, for longer contexts like Wikipedia articles, we only consider examples where the answer appears in the first 800 tokens.

As a result, some splits are harder than the original datasets (e.g., removal of multiple-choice in RACE), while some are easier (e.g., restricted context length in NaturalQuestions --- we use the short answer selection). Thus one should expect different performance ranges if comparing to previous work on these datasets.

Auxiliary Data

For additional sources of training data, we are whitelisting some non-QA datasets that may be helpful for multi-task learning or pretraining. If you have any other dataset in mind , please raise an issue or send us an email at mrforqa@gmail.com .

Whitelist:

Download Scripts

We have provided a convenience script to download all of the training and development data (that is released).

Please run:

./download_train.sh path/to/store/downloaded/directory

To download the development data of the training datasets (in-domain), run:

./download_in_domain_dev.sh path/to/store/downloaded/directory

To download the out-of-domain development data, run:

./download_out_of_domain_dev.sh path/to/store/downloaded/directory

MRQA Format

All of the datasets for this task have been adapted to follow a unified format. They are stored as compressed JSONL files (with file extension .jsonl.gz).

The general format is:

{
  "header": {
    "dataset": <dataset name>,
    "split": <train|dev|test>,
  }
}
...
{
  "context": <context text>,
  "context_tokens": [(token_1, offset_1), ..., (token_l, offset_l)],
  "qas": [
    {
      "qid": <uuid>,
      "question": <question text>,
      "question_tokens": [(token_1, offset_1), ..., (token_q, offset_q)],
      "detected_answers": [
        {
          "text": <answer text>,
          "char_spans": [[<start_1, end_1>], ..., [<start_n, end_n>]],
          "token_spans": [[<start_1, end_1>], ..., [<start_n, end_n>]],
        },
        ...
      ],
      "answers": [<answer_text_1>, ..., <answer_text_m>]
    },
    ...
  ]
}

Note that it is permissible to download the original datasets and use them as you wish. However, this is the format that the test data will be presented in.

Fields

Visualization

To view examples in the terminal please install requirements.txt (pip install requirements.txt) and then run:

python visualize.py path/or/url

The script argument may be either a URL or a local file path. For example:

python visualize.py https://s3.us-east-2.amazonaws.com/mrqa/release/train/SQuAD.jsonl.gz

Evaluation

Answers are evaluated using exact match and token-level F1 metrics. The mrqa_official_eval.py script is used to evaluate predictions on a given dataset:

python mrqa_official_eval.py <url_or_filename> <predictions_file>

The predictions file must be a valid JSON file of qid, answer pairs:

{
  "qid_1": "answer span text 1",
  ...
  "qid_n": "answer span text N"
}

The final score for the MRQA shared task will be the macro-average across all test datasets.

Baseline Model

An implementation of a simple multi-task BERT-based baseline model is available in the baseline directory.

Below are our baseline results (I = in-domain, O = out-of-domain):

DatasetMulti-Task BERT-BaseMulti-Task BERT-Large
(I) SQuAD78.5 / 86.780.3 / 88.4
(I) HotpotQA59.8 / 76.662.4 / 79.0
(I) TriviaQA Web65.6 / 71.668.2 / 74.7
(I) NewsQA50.8 / 66.849.6 / 66.3
(I) SearchQA69.5 / 76.771.8 / 79.0
(I) NaturalQuestions65.4 / 77.467.9 / 79.8
(O) DROP25.7 / 34.534.6 / 43.8
(O) RACE30.4 / 41.431.3 / 42.5
(O) BioASQ47.1 / 62.751.9 / 66.8
(O) TextbookQA44.9 / 53.947.4 / 55.7
(O) RelationExtraction72.6 / 83.872.7 / 85.2
(O) DuoRC44.8 / 54.646.8 / 58.0

Submission

Submission will be handled through the Codalab platform: see these instructions.

Note that submissions should start a local server that accepts POST requests of single JSON objects in our standard format, and returns a JSON prediction object. The official predict_server.py script (in this directory) will query this server to get predictions. The baseline directory includes an example implementation in serve.py. We have chosen this format so that we can create interactive demos for all submitted models.

Results

Codalab results for all models submitted to the shared task are available in the results directory. These files include the dev and test EM and F1 scores for every model and every dataset.

Citation

@inproceedings{fisch2019mrqa,
    title={{MRQA} 2019 Shared Task: Evaluating Generalization in Reading Comprehension},
    author={Adam Fisch and Alon Talmor and Robin Jia and Minjoon Seo and Eunsol Choi and Danqi Chen},
    booktitle={Proceedings of 2nd Machine Reading for Reading Comprehension (MRQA) Workshop at EMNLP},
    year={2019},
}