Home

Awesome

Reference QA Benchmarking Dataset

This repository contains factoid-curated, a reference question dataset for benchmarking Question Answering systems, as used e.g. by the YodaQA system.

The complete dataset is a combination of two sub-datasets (irc/ and trec/) and consists of three files:

(Some small portion of questions may be left out of the splits; these are included in large2470-train, though.)

Ideally, humans should be doing all stages of evaluation instead of just using regex matches, as time by time an unconcieved legitimate answer pops up and on the other hand, sometimes the regex is unintentionally over-permissive. For a similar reason (to allow e.g. leading the/a and other variations), we declare the answer as correct when the regex matches any substring.

Using This Dataset

As explained above, please use the test dataset only for performance reporting, not for question-by-question error analysis. Always report the version of the dataset you used - this is the v2 dataset

To make results comparable, it is not enough to use the same set of questions, we should strive to use the same or similar knowledge bases as well; we realize that especially as time goes, this might not be practical, but some degree of effort would be appreciated. We also expect the primary public YodaQA endpoints to track the latest version of this dataset, so these could be reused. We use:

Aims of This Dataset

We want to build a dataset of questions, which are:

We may want to relax either requirement, but at that point we should start tagging the questions to still keep a set of "simpler" ones. The motivation is to give a chance even to simple, focused systems.

Large Variant of the Dataset

An extra variant of the dataset is provided, which is a strict superset of the curated dataset. It is not curated, i.e. it is a lot more noisy and may not fulfill the above constraints. The aim of this dataset is to check behavior of QA systems when given larger, noisy training data, exploring generalization capabilities of systems.

The large2470 dataset is built by adding TREC 1999, 2000, 2001 data to the curated dataset. We also added questions asked on live.ailao.eu with user feedback until Feb 25, 2016. Unmodified questions were shuffled and 80% were added to the train split, 20% to the test split. Gold standard was revised by Mechanical Turk.

The dataset is called large2470, which refers to the number of questions in the dataset. We may build even larger datasets (e.g. including the WebQuestions, TREC years 2004+, QALD challenges or such) of similar nature in the future.

Discussion: How much data is appropriate?

It might be obvious to just use the largest dataset possible, but there are three issues with that: