Awesome

Document Layout Analysis repos for development with PdfPig.

From wikipedia: Document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis.

Related projects

PdfPig - Read text content from PDFs in C# (port of PdfBox)
camelot-sharp (port of camelot) - Extract tables from PDF files
tabula-sharp (port of tabula-java) - Extract tables from PDF files
PublayNetSharp - Extract and convert PubLayNet data to PageXml format
PublayNet-maskrcnn-mlnet - Using a MaskRCNN model trained on the PublayNet dataset with ML.Net in C# / .Net for Document layout analysis and page segmmentation task.
PdfPig MLNet Block Classifier - Proof of concept of training a simple Region Classifier using PdfPig and ML.NET (LightGBM).
PdfPig SVM Region Classifier - Proof of concept of a simple SVM Region Classifier using PdfPig and Accord.Net.
simple-docstrum - A step-by-step implementation of the Docstrum algorithm for pdf documents

Cited by

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis | Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li | website | github

Resources

Text extraction

High precision text extraction from PDF documents | Øyvind Raddum Berg
User-Guided Information Extraction from Print-Oriented Documents | Tamir Hassan
Combining Linguistic and Spatial Information for Document Analysis | Aiello, Monz and Todoran
New Methods for Metadata Extraction from Scientific Literature | Dominika Tkaczyk
A System for Converting PDF Documents into Structured XML Format | Hervé Déjean, Jean-Luc Meunier
Layout and Content Extraction for PDF Documents | Hui Chao, Jian Fan
DocParser: Hierarchical Structure Parsing of Document Renderings | J. Rausch, O. Martinez, F. Bissig, C. Zhang, and S. Feuerriegel

Word segmentation

An Efficient Word Segmentation Technique for Historical and Degraded Machine-Printed Documents | M. Makridis, N. Nikolaou, B. Gatos
Word Extraction Using Area Voronoi Diagram | Zhe Wang, Yue Lu, Chew Lim Tan
A word extraction algorithm for machine-printed documents using a 3D neighborhood graph model | Young-Jung Yu, Hwan-Gue Cho
Recognition of Multi-Oriented, Multi-Sized, and Curved Text | Yao-Yi Chiang, Craig A. Knoblock

example

Page segmentation

Performance Comparison of Six Algorithms for Page Segmentation | Faisal Shafait, Daniel Keysers, and Thomas M. Breuel
A Fast Algorithm for Bottom-Up Document Layout Analysis | Anikó Simon, Jean-Christophe Pret, and A. Peter Johnson
Empirical Performance Evaluation Methodology and its Application to Page Segmentation Algorithms: A Review | Pinky Gather, Avininder Singh
Layout Analysis based on Text Line Segment Hypotheses | Thomas M. Breuel
Hybrid Page Layout Analysis via Tab-Stop Detection | presentation | Ray Smith
Extending the Page Segmentation Algorithms of the Ocropus Documentation Layout Analysis System | Amy Alison Winder
Object-Level Document Analysis of PDF Files | Tamir Hassan
Document Image Segmentation as a Spectral Partitioning Problem | Dasigi, Jain and Jawahar
Benchmarking Page Segmentation Algorithms | S. Randriamasy, L. Vincent

Recursive XY Cut

The X-Y cut segmentation algorithm, also referred to as recursive X-Y cuts (RXYC) algorithm, is a tree-based top-down algorithm. The root of the tree represents the entire document page. All the leaf nodes together represent the final segmentation. The RXYC algorithm recursively splits the document into two or more smaller rectangular blocks which represent the nodes of the tree. At each step of the recursion, the horizontal and vertical projection profiles of each node are computed. Then, the valleys along the horizontal and vertical directions, VX and VY, are compared to corresponding predefined thresholds TX and TY. If the valley is larger than the threshold, the node is split at the mid-point of the wider of VX and VY into two children nodes. The process continues until no leaf node can be split further. Then, noise regions are removed using noise removal thresholds TnX and TnY. source example

Recursive X-Y Cut using Bounding Boxes of Connected Components | Jaekyu Ha, Robert M. Haralick and Ihsin T. Phillips

Docstrum

The Docstrum algorithm by Gorman is a bottom-up approach based on nearest-neighborhood clustering of connected components extracted from the document image. After noise removal, the connected components are separated into two groups, one with dominant characters and another one with characters in titles and section heading, using a character size ratio factor fd. Then, K nearest neighbors are found for each connected component. Then, text-lines are found by computing the transitive closure on within-line nearest neighbor pairings using a threshold ft. Finally, text-lines are merged to form text blocks using a parallel distance threshold fpa and a perpendicular distance threshold fpe. source example or example

The Document Spectrum for Page Layout Analysis | Lawrence O'Gorman
Document Structure and Layout Analysis | Anoop M. Namboodiri and Anil K. Jain
Document Layout Analysis | Garrett Hoch

Voronoi

The Voronoi-diagram based segmentation algorithm by Kise et al. is also a bottom-up algorithm. In the first step, it extracts sample points from the boundaries of the connected components using a sampling rate sr. Then, noise removal is done using a maximum noise zone size threshold nm, in addition to width, height, and aspect ratio thresholds. After that the Voronoi diagram is generated using sample points obtained from the borders of the connected components. Superfluous Voronoi edges are deleted using a criterion involving the area ratio threshold ta, and the inter-line spacing margin control factor fr. Since we evaluate all algorithms on document pages with Manhattan layouts, a modified version of the algorithm is used to generate rectangular zones.source

Voronoi++: A Dynamic Page Segmentation approach based on Voronoi and Docstrum features | Mudit Agrawal and David Doermann

Constrained text-line detection

The layout analysis approach by Breuel finds text-lines as a two step process:

Find tall whitespace rectangles and evaluate them as candidates for gutters, column separators, etc. The algorithm for finding maximal empty whitespace is described in Breuel. The whitespace rectangles are returned in order of decreasing quality and are allowed a maximum overlap of Om.
The whitespace rectangles representing the columns are used as obstacles in a robust least square, globally optimal text-line detection algorithm. Then, the bounding box of all the characters making the text-line is computed. The method was merely intended by its author as a demonstration of the application of two geometric algorithms, and not as a complete layout analysis system; nevertheless, we included it in the comparison because it has already proven useful in some applications. It is also nearly parameter free and resolution independent.source

Two Geometric Algorithms for Layout Analysis | Thomas M. Breuel
High precision text extraction from PDF documents | Øyvind Raddum Berg
High Performance Document Layout Analysis | Thomas M. Breuel

PDF/A standard

PDF/A-1a compliant document make the following information available:

Language specification
Hierarchical document structure
Tagged text spans and descriptive text for images and symbols
Character mappings to Unicode

Zone classification/extraction & Reading order

Page Segmentation and Zone Classification: The State of the Art | O. Okun, D. Doermann, M. Pietikainen
Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers | C. Clark, S. Divvala
PDFFigures 2.0: Mining Figures from Research Papers | C. Clark, S. Divvala
Document image zone classification: A simple high-performance approach | D. Keysers, F. Shafait, T. M. Breuel
Document-Zone Classification using Partial Least Squares and Hybrid Classifiers | W. Abd-Almageed, M. Agrawal, W. Seo, D. Doermann
The Zonemap Metric for Page Segmentation and Area Classification in Scanned Documents | O. Galibert, J. Kahn and I. Oparin
Layout analysis and content classification in digitized books | A. Corbelli, L. Baraldi, F. Balducci, C. Grana, R. Cucchiara

Reading order

Unsupervised document structure analysis of digital scientific articles | S. Klampfl, M. Granitzer, K. Jack, R. Kern
- Categorization of text blocks - Decorations
- Reading order
Document understanding for a broad class of documents | M. Aiello, C. Monz, L. Todoran, M. Worring
A Data Mining Approach to Reading Order Detection | M. Ceci, M. Berardi, G. A. Porcelli

Chart and diagram

FigureSeer: Parsing Result-Figures in Research Papers | N. Siegel, Z. Horvitz, R. Levin, S. Divvala, and A. Farhadi
Extraction, layout analysis and classification of diagrams in PDF documents | Robert P. Futrelle, Mingyan Shao, Chris Cieslik and Andrea Elaina Grimes
Graphics Recognition in PDF documents | Mingyan Shao and Robert P. Futrelle
A Study on the Document Zone Content Classification Problem | Yalin Wang, Ihsin T. Phillips, and Robert M. Haralick
Text/Figure Separation in Document Images Using Docstrum Descriptor and Two-Level Clustering | Valery Anisimovskiy, Ilya Kurilin, Andrey Shcherbinin, Petr Pohl
CHART-Synthetic
Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers | Christopher Clark and Santosh Divvala | website
Metrics for Evaluating Data Extraction from Charts | Adobe Research | github

Mathematical expression

A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files | Xing Wang, Jyh-Charn Liu
Mathematical Formula Identification in PDF Documents | Xiaoyan Lin, Liangcai Gao, Zhi Tang, Xiaofan Lin
Faithful Mathematical Formula Recognition from PDF Documents | Josef B. Baker, Alan P. Sexton and Volker Sorge
Extracting Precise Data from PDF Documents for Mathematical Formula Recognition | Josef B. Baker, Alan P. Sexton and Volker Sorge
Mathematical formula identification and performance evaluation in PDF documents | Xiaoyan Lin, Liangcai Gao, Zhi Tang, Josef Baker, Volker Sorge

Margins recognition

Finding blocks of text in an image using Python, OpenCV and numpy
Notes on the margins: how to extract them using image segmentation, Google Vision API, and R
A mixed approach to auto-detection of page body | Liangcai Gao, Zhi Tang, Ruiheng Qiu
Header and Footer Extraction by Page-Association | Xiaofan Lin
A System for Converting PDF Documents into Structured XML Format | Hervé Déjean, Jean-Luc Meunier

NLP & ML

A Graphical Approach to Document Layout Analysis | J. Wang, M. Krumdick, B. Tong, H. Halim, M. Sokolov, V. Barda, D. Vendryes, C. Tanner
Chargrid: Towards Understanding 2D Documents | A. R. Katti, C. Reisswig, C. Guder, S. Brarda, S. Bickel, J. Höhne, J. B. Faddoul | medium
Chargrid-OCR: End-to-end trainable Optical Character Recognition through Semantic Segmentation and Object Detection | C. Reisswig, A. R. Katti, M. Spinaci, J. Höhne | slides
BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding | Timo I. Denk, Christian Reisswig | slides
LayoutLM: Pre-Training of Text and Layout for Document Image Understanding | Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou | github
Detect2Rank: Combining Object Detectors UsingLearning to Rank | S. Karaoglu, Y. Liu., T. Gevers
DocParser: Hierarchical Structure Parsing of Document Renderings | J. Rausch, O. Martinez, F. Bissig, C. Zhang, and S. Feuerriegel | github | medium
LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis | Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li | website | github

Pre-trained models

Workshops

Workshop on Document Intelligence (DI 2019) at NeurIPS 2019

Datasets

DocBank: A Benchmark Dataset for Document Layout Analysis | M. Li, Y. Xu, L. Cui, S. Huang, F. Wei, Z. Li, M. Zhou | github
PubLayNet: largest dataset ever for document layout analysis | Zhong, Tang and Yepes | github | ibm article
DocParser: Hierarchical Structure Parsing of Document Renderings | J. Rausch, O. Martinez, F. Bissig, C. Zhang, and S. Feuerriegel
TableBank: Table Benchmark for Image-based Table Detection and Recognition | M. Li, L. Cui, S. Huang, F. Wei, M. Zhou and Z. Li
Document Image Datasets | Jonathan DeGange

Output file format

hOCR: hocr spec |
ALTO XML: alto schema |
TEI: tei-ocr | schema
PAGE: PAGE-XML |

Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)

Pdf page to image converter

A Pdf page to image converter is available to help in the research proces. It relies on the mupdf library, available in the sumatra pdf reader.

Pdf layout analysis viewer

A Pdf layout analysis viewer is available, also relies on the mupdf library.

viewer

Awesome

Related projects

Cited by

Resources

Text extraction

Word segmentation

Page segmentation

Recursive XY Cut

Docstrum

Voronoi

Constrained text-line detection

PDF/A standard

Zone classification/extraction & Reading order

Reading order

Table

Systems

Sparse line

Chart and diagram

Mathematical expression

Margins recognition

NLP & ML

Pre-trained models

Workshops

Related topics

Bounding boxes

Images

Shape detection

Character Recognition

Layout Similarity

Dehyphenation

Data structure

Datasets

Output file format

Pdf page to image converter

Pdf layout analysis viewer