Home

Awesome

Digital-born PDF Scanner

Genesis

Many of PDF files that we have downloaded are digital-born, that is contain easily accessible text layer that PDF viewers use to display text. Some are definitely scanned documents, that do not have any text layer at all, some are searchable OCR-processed scans that contain a lot of hidden text.

Since we want to tell apart all of these categories, we need a tool to detect them. Thus this tool.

Usage

In order to run use java -jar:

java -jar digital-born-pdf-scanner-0.0.1-SNAPSHOT-jar-with-dependencies.jar

Invoking the jar without parameters produces the following output:

  Options:
    -f, --filename
      Filename to check in single file mode
    -d, --input-dir
      Input directory to look for PDF files
    -o, --output-file-name
      File to write results to. Supported extensions are *.tsv, *.csv
      Default: results.tsv
    -r, --recursive
      Whether to search for PDF files recursively
      Default: false
    --sort
      Whether to sort file name in results.
      Default: false
    -v, --verbose
      Whether to print processed file names.
      Default: false

Clearly, there are two modes of operation:

The latter can be used with recursive directory scan (use -r), which searches subdirectories for PDF files.

The output will be stored to TAB-separated file results.tsv unless different file name is provided. The output will be either semicolon or TAB-separated depending on file extension.

Handling error log

Since there might be a lot of errors coming from failed files printed, it makes sense to redirect logs to a file, for example:

java -jar digital-born-pdf-scanner-0.0.1-SNAPSHOT-jar-with-dependencies.jar -d dir-with-pdfs -r --sort -v 2> error.log

Tracking progress

Currently the only way to track processing progress is to enable verbose output (-v) and couple it with file sorting (--sort). This will produce nice color output showing which file is successfully processed (or whether there was processing failure). Please see example above.

Output interpretation

The output file consists of following columns:

Column NameDescription
File NamePath to PDF file
Has Hidden TextIs hidden text present in a document
Visible Text LenLength of visible text in a document
Hidden Text LenLength of hidden text in a document
CreatorName of a software that created a document (if any)
ProducerName of a software library used to produce a document (if any)
Page CountNumber of pages in a document
Max Covered Area RatioMaximal ratio of the largest image area to page area (often greater than 1...)
Avg Covered Area RatioAverage ratio of the largest image area to page area (often greater than 1...)
Image CountNumber of images in a document
Object CountNumber of objects in a document
PDF VersionVersion of PDF standard
Has OutlinesWhether document contains outlines
Is TaggedDoes document have tag structure
LangContent language
Conformance LevelDocument conformance level
Has Page LabelsWhether there are page labels in a document

At a time of writing, the tool does not tell you if document is scanned, searchable scanned, or is digital born. However, certain heuristics can be deduced from the output:

The question is what is "a lot of text". Well, we have to check to know.