Home

Awesome

TableBank

TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables.

News

Introduction

To address the need for a standard open domain table benchmark dataset, we propose a novel weak supervision approach to automatically create the TableBank, which is orders of magnitude larger than existing human labeled datasets for table analysis. Distinct from traditional weakly supervised training set, our approach can obtain not only large scale but also high quality training data.

Nowadays, there are a great number of electronic documents on the web such as Microsoft Word (.docx) and Latex (.tex) files. These online documents contain mark-up tags for tables in their source code by nature. Intuitively, we can manipulate these source code by adding bounding box using the mark-up language within each document. For Word documents, the internal Office XML code can be modified where the borderline of each table is identified. For Latex documents, the tex code can be also modified where bounding boxes of tables are recognized. In this way, high-quality labeled data is created for a variety of domains such as business documents, official fillings, research papers etc, which is tremendously beneficial for large-scale table analysis tasks.

The TableBank dataset totally consists of 417,234 high quality labeled tables as well as their original documents in a variety of domains.

Statistics of TableBank

Based on the number of tables

TaskWordLatexWord+Latex
Table detection163,417253,817417,234
Table structure recognition56,86688,597145,463

Based on the number of images

TaskWordLatexWord+Latex
Table detection78,399200,183278,582
Table structure recognition56,86688,597145,463

Statistics on Train/Val/Test sets of Table Detection

SourceTrainValTest
Latex18719972655719
Word7338327352281
Total260582100008000

Statistics on Train/Val/Test sets of Table Structure Recognition

SourceTrainValTest
Latex7948660753036
Word5097739251964
Total130463100005000

Task Definition

Table Detection

Table detection aims to locate tables using bounding boxes in a document. Given a document page in the image format, generating several bounding box that represents the location of tables in this page.

Table Structure Recognition

Table structure recognition aims to identify the row and column layout structure for the tables especially in non-digital document formats such as scanned images. Given a table in the image format, generating an HTML tag sequence that represents the arrangement of rows and columns as well as the type of table cells.

Baselines

To verify the effectiveness of Table-Bank, we build several strong baselines using the state-of-the-art models with end-to-end deep neural networks. The table detection model is based on the Faster R-CNN [Ren et al., 2015] architecture with different settings. The table structure recognition model is based on the encoder-decoder framework for image-to-text.

Data and Metrics

To evaluate table detection, we sample 18,000 document images from Word and Latex documents, where 10,000 images for validation and 8,000 images for testing. Each sampled image contains at least one table. Meanwhile, we also evaluate our model on the ICDAR 2013 dataset to verify the effectiveness of TableBank. To evaluate table structure recognition, we sample 15,000 table images from Word and Latex documents, where 10,000 images for validation and 5,000 images for testing. For table detection, we calculate the precision, recall and F1 in the way described in our paper, where the metrics for all documents are computed by summing up the area of overlap, prediction and ground truth. For table structure recognition, we use the 4-gram BLEU score as the evaluation metric with a single reference.

Table Detection

We use the open-source framework Detectron2 [Wu et al., 2019] to train models on the TableBank. Detectron2 is a high-quality and high-performance codebase for object detection research, which supports many state-of-the-art algorithms. In this task, we use the Faster R-CNN algorithm with the ResNeXt [Xie et al., 2016] as the backbone network architecture, where the parameters are pre-trained on the ImageNet dataset. All baselines are trained using 4 V100 NVIDIA GPUs using data-parallel sync SGD with a minibatch size of 20 images. For other parameters, we use the default values in Detectron2. During testing, the confidence threshold of generating bounding boxes is set to 90%.

ModelsWordLatexWord+Latex
PrecisionRecallF1PrecisionRecallF1PrecisionRecallF1
X101(Word)0.93520.93980.93750.99050.58510.73560.95790.74740.8397
X152(Word)0.94180.94150.94160.99120.68820.81240.96410.80410.8769
X101(Latex)0.84530.93350.88720.98190.97990.98090.91590.95870.9368
X152(Latex)0.84760.92640.88530.98160.98140.98150.91730.95620.9364
X101(Word+Latex)0.91780.93630.92700.98270.97840.98060.95260.95920.9559
X152(Word+Latex)0.92290.92660.92470.98370.97520.97950.95570.95300.9543

Table Structure Recognition

For table structure recognition, we use the open-source framework OpenNMT [Klein et al., 2017] to train the image-to-text model. OpenNMT is mainly designed for neural machine translation, which supports many encoder-decoder frameworks. In this task, we train our model using the image-to-text method in OpenNMT. The model is also trained using 4 V100 NVIDIA GPUs with the learning rate of 1 and batch size of 24. For other parameters, we use the default values in OpenNMT.

ModelsWordLatexWord+Latex
Image-to-Text (Word)59.1869.7665.75
Image-to-Text (Latex)51.4571.6363.08
Image-to-Text (Word+Latex)69.9377.9474.54

Model Zoo

The trained models are available for download in the TableBank Model Zoo.

<!-- ## Quick Start Here is a pipeline to test pretrained model and visualize the performance of Table Detection task. [Table Detection](TestPretrainedModel.md). -->

Get Data and Leaderboard

<!-- **Because some data has copyright issues and should not be released, we filtered all the data and excluded them. We also retrain all the baseline model on the changed dataset and list them on the leaderboard website.** -->

**Please DO NOT re-distribute our data.**

If you use the corpus in published work, please cite it referring to the "Paper and Citation" Section.

The annotations and original document pictures of the TableBank dataset can be download from HuggingFace.

<!-- The leaderboard website is [https://doc-analysis.github.io/](https://doc-analysis.github.io/). If you would like to add a paper that reports a number at or above the current state of the art, email [Minghao Li](mailto:liminghao1630@buaa.edu.cn). --> <!-- ### Statistics of TableBank (Removing copyright protection data) | Task | Word | Latex | Word+Latex | |-----------------------------|---------|---------|------------| | Table detection | 101,889 | 253,817 | 355,706 | | Table structure recognition | 56,866 | 88,597 | 145,463 | -->

Paper and Citation

https://arxiv.org/abs/1903.01949

@misc{li2019tablebank,
    title={TableBank: A Benchmark Dataset for Table Detection and Recognition},
    author={Minghao Li and Lei Cui and Shaohan Huang and Furu Wei and Ming Zhou and Zhoujun Li},
    year={2019},
    eprint={1903.01949},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

References