Awesome
PdfPig SVM Region Classifier
Proof of concept of a simple Support Vector Machine Region Classifier using PdfPig and Accord.Net. The model was trained on a subset of the PubLayNet dataset. See their license here.
The objective is to classify each text block using machine learning in a pdf document page as either title, text, list, table and image.
The annotions from the dataset (see sample here) were converted to the PAGE xml format. See the PageXmlConverter
to convert the json file into PAGE xml files. Images from the dataset were not used. You will need to download the pdf documents separately as we leverage the pdf documents features directly instead.
Labels
Following the PubLayNet methodology, the following categories are available:
Label | id (svm) |
---|---|
title | 0 |
text | 1 |
list | 2 |
table | 3 |
image | 4 |
Features
Text
- Character count
- Percentage of numeric characters
- Percentage of alphabetical characters
- Percentage of symbolic characters
- Percentage of bullet characters
- Average delta to average page glyph height
Paths
- Path count
- Percentage of Bezier curve paths
- Percentage of horizontal paths
- Percentage of vertical paths
- Percentage of oblique paths
Images
- Image count
- Average area covered by images
Code
See the GenerateData
class to generate a csv file with the features, using the pdf documents, and their respective PageXml ground truth (one xml document per page). See the FeatureHelper
class to easily generate the features vector from a block.
Results (in sample)
Accuracy
Model accuracy = 90.898
Normalised confusion matrix
Confusion matrix
title | text | list | table | image | |
---|---|---|---|---|---|
title | 9312 | 1592 | 19 | 3 | 135 |
text | 1166 | 37136 | 988 | 820 | 32 |
list | 0 | 1 | 32 | 0 | 0 |
table | 0 | 16 | 4 | 1092 | 3 |
image | 0 | 0 | 0 | 0 | 154 |
Precision, Recall and F1 score
Precision | Recall | F1 score | |
---|---|---|---|
title | 0.842 | 0.889 | 0.865 |
text | 0.925 | 0.958 | 0.941 |
list | 0.970 | 0.031 | 0.059 |
table | 0.979 | 0.570 | 0.721 |
image | 1.000 | 0.475 | 0.644 |
Code
See the Trainer
class to train and evaluate the model.
After training, the SVM model will be saved as a Gzip.
Usage
Once the training is finished, you can test the classification on a new pdf document by using either DocstrumBoundingBoxes or RecursiveXYCut to generate the text blocks, and then classify each block.
See SvmZoneClassifier
for a demo implementation. The SVM trained model is available here.