Home

Awesome

<!-- This should be the location of the title of the repository, normally the short name -->

AIT-QA: Question Answering Dataset over Complex Tables in the Airline Industry

<!-- Build Status, is a great thing to have at the top of your repository, it shows that you take your CI/CD as first class citizens --> <!-- [![Build Status](https://travis-ci.org/jjasghar/ibm-cloud-cli.svg?branch=master)](https://travis-ci.org/jjasghar/ibm-cloud-cli) --> <!-- Not always needed, but a scope helps the user understand in a short sentance like below, why this repo exists -->

Abstract

Recent advances in transformers have enabled Table Question Answering (Table QA) systems to achieve high accuracy and SOTA results on open domain datasets like WikiTableQuestions and WikiSQL. Such transformers are frequently pre-trained on open-domain content such as Wikipedia, where they effectively encode questions and corresponding tables from Wikipedia as seen in Table QA dataset. However, web tables in Wikipedia are notably flat in their layout, with the first row as the sole column header. The layout lends to a relational view of tables where each row is a tuple. Whereas, tables in domain-specific business or scientific documents often have a much more complex layout, including hierarchical row and column headers, in addition to having specialized vocabulary terms from that domain. To address this problem, we introduce the domain-specific Table QA dataset AITQA (Airline Industry Table QA). The dataset consists of 515 questions authored by human annotators on 116 tables extracted from public U.S. SEC filings (SEC Filings publicly available at: https://www.sec.gov/edgar.shtml) of major airline companies for the fiscal years 2017-2019. We also provide annotations pertaining to the nature of questions, marking those that require hierarchical headers, domain-specific terminology, and paraphrased forms. Our zero-shot baseline evaluation of three transformer-based SOTA Table QA methods - TaPAS (end-to-end), TaBERT (semantic parsing-based), and RCI (row-column encoding-based) - clearly exposes the limitation of these methods in this practical setting, with the best accuracy at just 51.8% (RCI). We also present pragmatic table pre-processing steps used to pivot and project these complex tables into a layout suitable for the SOTA Table QA models.

Dataset

This repository contains the data corresponding to the paper. Please consider refering to the paper draft if you are using them, as below:

@misc{katsis2021aitqa,
      title={AIT-QA: Question Answering Dataset over Complex Tables in the Airline Industry}, 
      author={Yannis Katsis and Saneem Chemmengath and Vishwajeet Kumar and Samarth Bharadwaj and Mustafa Canim and Michael Glass and Alfio Gliozzo and Feifei Pan and Jaydeep Sen and Karthik Sankaranarayanan and Soumen Chakrabarti},
      year={2021},
      eprint={2106.12944},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Detailed information about this dataset and initial experiments can be found here: https://arxiv.org/abs/2106.12944

Content and format

Inside the raw_data folder you will find aitqa_questions.jsonl and aitqa_tables.jsonl. Both files can be read line by line, where each line is a serialized JSON object. The test questions and tables will be released upon acceptance of the dataset paper in submission (details above).

Question, answers, table_id, and other annotations

The instances in aitqa_questions.jsonl looks like the follwing:

{
  "id": "q-0",
  "table_id": "tab-0",
  "question": "How much money did United spend for aircraft fuel in 2016?",
  "answers": ["$5,813"],
  "type": "KPI-driven",
  "row_hierarchy_needed": "No",
  "paraphrase_group": "para-5"
}

The fields represent the follwing:

Tables

The instances in aitqa_tables.jsonl looks like the follwing:

{  
  "id": "tab-5",
  "column_header": [
    ["At December 31,", "2018"],
    ["At December 31,", "2017 (a)"]
  ],
  "row_header": [
    ["Current assets:", "Cash and cash equivalents"],
    ["Current assets:", "Short-term investments"],
    ["Current assets:", "Receivables, less allowance for doubtful accounts 2018—$8; 2017—$7)"],
    ["Current assets:", "Aircraft fuel, spare parts and supplies, less obsolescence allowance (2018—$412; 2017—$354)"],
    ["Current assets:", "Prepaid expenses and other"],
    ["Current assets:", "Total current assets"],
    ["Owned-", "Operating property and equipment:", "Flight equipment"],
    ["Owned-", "Operating property and equipment:", "Other property and equipment"],
    ["Owned-", "Operating property and equipment:", "Total owned property and equipment"],
    ["Owned-", "Operating property and equipment:", "Less-Accumulated depreciation and amortization"],
    ["Owned-", "Operating property and equipment:", "Total owned property and equipment, net"],
    ["Owned-", "Operating property and equipment:", "Purchase deposits for flight equipment"],
    ["Capital leases-", "Flight equipment"],
    ["Capital leases-", "Other property and equipment"],
    ["Capital leases-", "Total capital leases"],
    ["Capital leases-", "Less-Accumulated amortization"],
    ["Capital leases-", "Total capital leases, net"],
    ["Capital leases-", "Total operating property and equipment, net"],
    ["Other assets:", "Goodwill"],
    ["Other assets:", "Intangibles, less accumulated amortization (2018-$1,380; 2017-$1,313)"],
    ["Other assets:", "Restricted cash"],
    ["Other assets:", "Notes receivable, net"],
    ["Other assets:", "Investments in affiliates and other, net"],
    ["Other assets:", "Total other assets"],
    ["Total assets"]
  ],
  "data": [
    ["$1,694", "$1,482"],
    ["2,256", "2,316"],
    ["1,346", "1,340"],
    ["985", "924"],
    ["913", "1,071"],
    ["7,194", "7,133"],
    ["31,607", "28,692"],
    ["7,919", "6,946"],
    ["39,526", "35,638"],
    ["(12,760)", "(11,159)"],
    ["26,766", "24,479"],
    ["1,177", "1,344"],
    ["1,029", "1,151"],
    ["11", "11"],
    ["1,040", "1,162"],
    ["(654)", "(777)"],
    ["386", "385"],
    ["28,329", "26,208"],
    ["4,523", "4,523"],
    ["3,159", "3,539"],
    ["105", "91"],
    ["516", "46"],
    ["966", "806"],
    ["9,269", "9,005"],
    ["$44,792", "$42,346"]
  ]
}

The fields represent the following:

<!-- A more detailed Usage or detailed explaination of the repository here -->

Notes

<!-- A Changelog allows you to track major changes and things that happen, https://github.com/github-changelog-generator/github-changelog-generator can help automate the process -->

Any changes or improvement to the dataset is logged in CHANGELOG.md

<!-- A notes section is useful for anything that isn't covered in the Usage or Scope. Like what we have below. --> <!-- Questions can be useful but optional, this gives you a place to say, "This is how to contact this project maintainers or create PRs -->

If you have any questions please yse tge discussion feature and not issues. Create a new [issue here][issues] to flag dataset inconsistancies or improvements generally after a discussion.

We will allow Contributions to the repo as per CONTRIBUTING.md