Home

Awesome

Awesome Document Understanding Awesome

A curated list of resources for Document Understanding (DU) topic related to Intelligent Document Processing (IDP), which is relative to Robotic Process Automation (RPA) from unstructured data, especially form Visually Rich Documents (VRDs).

Note 1: bolded positions are more important then others.

Note 2: due to the novelty of the field, this list is under construction - contributions are welcome (thank you in advance!). Please remember to use following convention:

<br/><br/>

<p align="center"> <a href="https://openreview.net/forum?id=rNs2FvJGDK"> <img src="images/du_example.png"> </a> </p> <br/><br/>

Table of contents

  1. Introduction
  2. Research topics
    1. Key Information Extraction (KIE)
    2. Document Layout Analysis (DLA)
    3. Document Question Answering (DQA)
    4. Scientific Document Understanding (SDU)
    5. Optical Character Recognition (OCR)
    6. Related
      1. General
      2. Tabular Data Comprehension (TDC)
      3. Robotic Process Automation (RPA)
  3. Others
    1. Resources
      1. Datasets for Pre-training Language Models
      2. PDF processing tools
    2. Conferences / workshops
    3. Blogs
    4. Solutions
  4. Examples
    1. Visually Rich Documents (VRDs)
    2. Key Information Extraction (KIE)
    3. Document Layout Analysis (DLA)
    4. Document Question Answering (DQA)
  5. Inspirations

Introduction

Documents are a core part of many businesses in many fields such as law, finance, and technology among others. Automatic understanding of documents such as invoices, contracts, and resumes is lucrative, opening up many new avenues of business. The fields of natural language processing and computer vision have seen tremendous progress through the development of deep learning such that these methods have started to become infused in contemporary document understanding systems. source

Papers

2023

2022

2021

2020

2018

Older

Research topics

Others

Resources

Back to top

Datasets for Pre-training Language Models

  1. The RVL-CDIP Dataset - dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class
  2. The Industry Documents Library - a portal to millions of documents created by industries that influence public health, hosted by the UCSF Library
  3. Color Document Dataset - from the Intelligent Sensory Information Systems, University of Amsterdam
  4. The IIT CDIP Collection - dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s, consists of around 7 million documents

PDF processing tools

  1. borb - is a pure python library to read, write and manipulate PDF documents. It represents a PDF document as a JSON-like datastructure of nested lists, dictionaries and primitives (numbers, string, booleans, etc).
  2. pawls - PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated with a PDF document
  3. pdfplumber - Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging
  4. Pdfminer.six - Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data
  5. Layout Parser - Layout Parser is a deep learning based tool for document image layout analysis tasks
  6. Tabulo - Table extraction from images
  7. OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted
  8. PDFBox - The Apache PDFBox library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents
  9. PdfPig - This project allows users to read and extract text and other content from PDF files. In addition the library can be used to create simple PDF documents containing text and geometrical shapes. This project aims to port PDFBox to C#
  10. parsing-prickly-pdfs - Resources and worksheet for the NICAR 2016 workshop of the same name
  11. pdf-text-extraction-benchmark - PDF tools benchmark
  12. Born digital pdf scanner - checking if pdf is born-digital
  13. OpenContracts Apache2-licensed, PDF annotating platform for visually-rich documents that preserves the original layout and exports x,y positional data for tokens as well as span starts and stops. Based on PAWLs, but with a Python-based backend and readily deployable on your local machine, company intranet or the web via Docker Compose.
  14. deepdoctection deepdoctection is a Python library that orchestrates document extraction and document layout analysis tasks for images and pdf documents using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated framework for fine-tuning, evaluating and running models.
  15. pydoxtools Pydoxtools is an AI-composition library for dpocument analysis. It features an extensive toolset for building complex document analysis pipelines and recognizes most document formats out of the box. It supports typical NLP tasks such as keywords, summarization, question_answering out of the box. and features a high quality low-CPU/memory table extraction algorithm and makes NLP batch operations on a cluster easy.

Conferences, workshops

Back to top

General/ Business / Finance

  1. International Conference on Document Analysis and Recognition (ICDAR) [2021, 2019, 2017]
  2. Workshop on Document Intelligence (DI) [2021, 2019]
  3. Financial Narrative Processing Workshop (FNP) [2021, 2020, 2019 ]
  4. Workshop on Economics and Natural Language Processing (ECONLP) [2021, 2019, 2018 ]
  5. INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS) [2020, 2018, 2016]
  6. ACM International Conference on AI in Finance (ICAIF)
  7. The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services
  8. CVPR 2020 Workshop on Text and Documents in the Deep Learning Era
  9. KDD Workshop on Machine Learning in Finance (KDD MLF 2020)
  10. FinIR 2020: The First Workshop on Information Retrieval in Finance
  11. 2nd KDD Workshop on Anomaly Detection in Finance (KDD 2019)
  12. Document Understanding Conference (DUC 2007)

Scientific Document Understanding

  1. The AAAI-21 Workshop on Scientific Document Understanding (SDU 2021)
  2. First Workshop on Scholarly Document Processing (SDProc 2020)
  3. International Workshop on SCIentific DOCument Analysis (SCIDOCA) [2020, 2018, 2017 ]

Blogs

Back to top

  1. A Survey of Document Understanding Models, 2021
  2. Document Form Extraction, 2021
  3. How to automate processes with unstructured data, 2021
  4. A Comprehensive Guide to OCR with RPA and Document Understanding, 2021
  5. Information Extraction from Receipts with Graph Convolutional Networks, 2021
  6. How to extract structured data from invoices, 2021
  7. Extracting Structured Data from Templatic Documents, 2020
  8. To apply AI for good, think form extraction, 2020
  9. UiPath Document Understanding Solution Architecture and Approach, 2020
  10. How Can I Automate Data Extraction from Complex Documents?, 2020
  11. LegalTech: Information Extraction in legal documents, 2020

Solutions

Back to top

Big companies:

  1. Abby
  2. Accenture
  3. Amazon
  4. Google
  5. Microsoft
  6. Uipath

Smaller:

  1. Applica.ai
  2. Base64.ai
  3. Docstack
  4. Element AI
  5. Indico
  6. Instabase
  7. Konfuzio
  8. Metamaze
  9. Nanonets
  10. Rossum
  11. Silo

Examples

Visually Rich Documents

Back to top

In VRDs the importance of the layout information is crucial to understand the whole document correctly (this is the case with almost all business documents). For humans spatial information improves readability and speeds document understanding.

Invoice / Resume / Job Ad

<p align="center"> <a href="https://arxiv.org/pdf/2005.11017.pdf"> <img src="images/vrd_examples_2v2.png"> </a> </p> <br/><br/>

NDA / Annual reports

<p align="center"> <a href="https://arxiv.org/abs/2003.02356"> <img src="images/vrd_examples_1.png"> </a> </p> <br/><br/>

Key Information Extraction

Back to top

The aim of this task is to extract texts of a number of key fields from a given collection of documents containing similar key entities.

<br/>

Scanned Receipts

<p align="center"> <a href="https://medium.com/analytics-vidhya/extracting-structured-data-from-invoice-96cf5e548e40"> <img src="images/kie_examples_1.png"> </a> </p> <br/><br/>

NDA / Annual reports

Examples of a real business applications and data for Kleister datasets (The key entities are in blue)

<p align="center"> <a href="https://arxiv.org/abs/2003.02356"> <img src="images/kie_examples_2.png"> </a> </p> <br/><br/>

Multimedia Online Flyers

An example of a commercial real estate flyer and manually entered listing information © ProMaker Commercial Real Estate LLC, © BrokerSavant Inc.

<p align="center"> <a href="https://www.aclweb.org/anthology/N15-1032.pdf"> <img src="images/kie_examples_3.png"> </a> </p> <br/><br/>

Value-added tax invoice

<p align="center"> <a href="https://arxiv.org/pdf/1903.11279.pdf"> <img src="images/kie_examples_4.png"> </a> </p> <br/><br/>

Webpages

<p align="center"> <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/StructedDataExtraction_SIGIR2011.pdf"> <img src="images/kie_examples_5.png"> </a> </p> <br/><br/>

Document Layout Analysis

Back to top

In computer vision or natural language processing, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis. (https://en.wikipedia.org/wiki/Document_layout_analysis)

Scientific publication

<p align="center"> <a href="https://arxiv.org/pdf/1908.07836.pdf"> <img src="images/dla_examples_1.png"> </a> </p> <br/><br/> <p align="center"> <a href="https://arxiv.org/pdf/2006.01038.pdf"> <img src="images/dla_examples_2.png"> </a> </p> <br/><br/>

Historical newspapers

<p align="center"> <a href="https://primaresearch.org/www/assets/papers/ICDAR2015_Clausner_ENPDataset.pdf"> <img src="images/dla_examples_3.png"> </a> </p> <br/><br/>

Business documents

Red: text block, Blue: figure.

<p align="center"> <a href="http://personal.psu.edu/duh188/papers/ICDAR2017_DAFANG.pdf"> <img src="images/dla_examples_4.png"> </a> </p> <br/><br/>

Document Question Answering

Back to top

DocVQA example

<p align="center"> <a href="https://arxiv.org/pdf/2007.00398.pdf"> <img src="images/dqa_example_2.png"> </a> </p> <br/><br/>

Tilt model demo

<p align="center"> <a href="https://arxiv.org/pdf/2102.09550.pdf"> <img src="images/dqa_example_1.gif"> </a> </p> <br/><br/>

Inspirations

Back to top

Domain

  1. https://github.com/kba/awesome-ocr
  2. https://github.com/Liquid-Legal-Institute/Legal-Text-Analytics
  3. https://github.com/icoxfog417/awesome-financial-nlp
  4. https://github.com/BobLd/DocumentLayoutAnalysis
  5. https://github.com/bikash/DocumentUnderstanding
  6. https://github.com/harpribot/awesome-information-retrieval
  7. https://github.com/roomylee/awesome-relation-extraction
  8. https://github.com/caufieldjh/awesome-bioie
  9. https://github.com/HelloRusk/entity-related-papers
  10. https://github.com/pliang279/awesome-multimodal-ml
  11. https://github.com/thunlp/LegalPapers
  12. https://github.com/heartexlabs/awesome-data-labeling

General AI/DL/ML

  1. https://github.com/jsbroks/awesome-dataset-tools
  2. https://github.com/EthicalML/awesome-production-machine-learning
  3. https://github.com/eugeneyan/applied-ml
  4. https://github.com/awesomedata/awesome-public-datasets
  5. https://github.com/keon/awesome-nlp
  6. https://github.com/thunlp/PLMpapers
  7. https://github.com/jbhuang0604/awesome-computer-vision#awesome-lists
  8. https://github.com/papers-we-love/papers-we-love
  9. https://github.com/BAILOOL/DoYouEvenLearn
  10. https://github.com/hibayesian/awesome-automl-papers