Home

Awesome

WikiTableQuestions Dataset

Version 1.0.2 (October 4, 2016)

Introduction

The WikiTableQuestions dataset is for the task of question answering on semi-structured HTML tables as presented in the paper:

Panupong Pasupat, Percy Liang.
Compositional Semantic Parsing on Semi-Structured Tables
Association for Computational Linguistics (ACL), 2015.

More details about the project: https://nlp.stanford.edu/software/sempre/wikitable/

TSV Format

Many files in this dataset are stored as tab-separated values (TSV) with the following special constructs:

Questions and Answers

The data/ directory contains the questions, answers, and the ID of the tables that the questions are asking about.

Each portion of the dataset is stored as a TSV file where each line contains one example.

Field descriptions:

Dataset Splits: We split 22033 examples into multiple sets:

For our ACL 2015 paper:

Supplementary Files:

Tables

The csv/ directory contains the extracted tables, while the page/ directory contains the raw HTML data of the whole web page.

Table Formats:

The conversion from HTML to CSV and TSV was done using table-to-tsv.py. Its dependency is in the weblib/ directory.

CoreNLP Tagged Files

Questions and tables are tagged using CoreNLP 3.5.2. The annotation is not perfect (e.g., it cannot detect the date "13-12-1989"), but it is usually good enough.

Evaluator

evaluator.py is the official evaluator.

Usage: evaluator.py <tagged_dataset_path> <prediction_path>

Note that the resulting scores will be different from what SEMPRE produces as SEMPRE also enforces the prediction to have the same type as the target value, while the official evaluator is more lenient.

Version History

1.0 - Fixed various bugs in datasets (encoding issues, number normalization issues)

0.5 - Added evaluator

0.4 - Added annotated logical forms of the first 300 examples / Renamed CoreNLP tagged data as tagged to avoid confusion

0.3 - Repaired table headers / Added raw HTML tables / Added CoreNLP tagged data

0.2 - Initial release

For questions and comments, please contact Ice Pasupat ppasupat@cs.stanford.edu