Awesome

PdfPig SVM Region Classifier

Proof of concept of a simple Support Vector Machine Region Classifier using PdfPig and Accord.Net. The model was trained on a subset of the PubLayNet dataset. See their license here.

The objective is to classify each text block using machine learning in a pdf document page as either title, text, list, table and image.

The annotions from the dataset (see sample here) were converted to the PAGE xml format. See the PageXmlConverter to convert the json file into PAGE xml files. Images from the dataset were not used. You will need to download the pdf documents separately as we leverage the pdf documents features directly instead.

Labels

Following the PubLayNet methodology, the following categories are available:

Label	id (svm)
title	0
text	1
list	2
table	3
image	4

Features

Text

Character count
Percentage of numeric characters
Percentage of alphabetical characters
Percentage of symbolic characters
Percentage of bullet characters
Average delta to average page glyph height

Paths

Path count
Percentage of Bezier curve paths
Percentage of horizontal paths
Percentage of vertical paths
Percentage of oblique paths

Images

Image count
Average area covered by images

Code

See the GenerateData class to generate a csv file with the features, using the pdf documents, and their respective PageXml ground truth (one xml document per page). See the FeatureHelper class to easily generate the features vector from a block.

Results (in sample)

Accuracy

Model accuracy = 90.898

Normalised confusion matrix

Confusion matrix

	title	text	list	table	image
title	9312	1592	19	3	135
text	1166	37136	988	820	32
list	0	1	32	0	0
table	0	16	4	1092	3
image	0	0	0	0	154

Precision, Recall and F1 score

	Precision	Recall	F1 score
title	0.842	0.889	0.865
text	0.925	0.958	0.941
list	0.970	0.031	0.059
table	0.979	0.570	0.721
image	1.000	0.475	0.644

Code

See the Trainer class to train and evaluate the model. After training, the SVM model will be saved as a Gzip.

Usage

Once the training is finished, you can test the classification on a new pdf document by using either DocstrumBoundingBoxes or RecursiveXYCut to generate the text blocks, and then classify each block. See SvmZoneClassifier for a demo implementation. The SVM trained model is available here.

Awesome

PdfPig SVM Region Classifier

Labels

Features

Text

Paths

Images

Code

Results (in sample)

Accuracy

Normalised confusion matrix

Confusion matrix

Precision, Recall and F1 score

Code

Usage

References