Home

Awesome

bit.ly/parsing-prickly-pdfs

Resources and worksheet for the NICAR 2016 workshop of the same name. Instructors: Jacob Fenton (jsfenfen@gmail.com) and Jeremy Singer-Vine (jsvine@gmail.com).

What We'll Cover

The One Weird Thing You Need To Know About PDFs

Optical Character Recognition

Optical Character Recognition (OCR) tools try to extract text from image-based PDFs. Depending on the software, the tool may either (a) convert your image-based PDF into an embedded-text PDF, or (b) simply output the text it has found. Some popular options:

GUI (Free)

GUI (Paid)

Command-Line

Extracting Structured Data From PDFs

Tabula

Tabula helps you extract data tables from PDFs. It's free and open-source, and uses an intuitive graphical interface.

pdftotext

pdftotext is the simplest way to convert a PDF to raw text.

pdftohtml

pdftohtml comes from the same Poppler suite of tools as pdftotext.

tabula-java

tabula-java is the Java library underpinning Tabula, and command-line tool that lets you automate table-extraction.

PDFPlumber

PDFPlumber is a Python library and command-line tool for extracting information from PDFs. Both tools provide granular information about each character, rectangle, and line. The library also has Tabula-style features for extracting tables and text.

Structured information from Tesseract

Preprocessing

This is the simplified version: see full details in the examples/WFLX dir.

Tesseract operates on image files, so you'll need to convert pdfs to images first. The simplest way is probably to use imagemagick. For installation see here.

For examples/WFLX/sample_contract.pdf, convert the first page with: convert -density 300 ./sample_contract.pdf[1] ./sample_contract_p1.png ; the result is here.

Get structured info from that file

Pay attention to tesseract versions. 3.04 is current; 3.01 is needed for bounding box stuff.

Run tesseract with this one-line config file, which tells it to output files in the hOCR format. tesseract sample_contract_p1.png p1_hocr ./configfile.

There's a hack here to convert the resulting file examples/WFLX/p1_hocr.hocr to csv.

$ python convert_hocr.py examples/WFLX/p1_hocr.hocr examples/WFLX/p1_hocr.csv

The output will be examples/WFLX/p1_hocr.csv which has word level bounding boxes.

This is starting to look a lot like 'regular' PDF

With a little bit of tooling, we've managed to make image-based PDFs look like regular PDFs: a csv (or json file) of words and their bounding boxes. This is pretty significant because we can use the same strategy to parse image-based pdfs as text-based PDFs, with a few caveats:

  1. The font / fontsize information isn't available
  2. Lower text quality requires more fuzzy matching
  3. Alignment / image quality issues loom larger

The longer view: making it visual

What do you do with bounding box data. You can stare at it in Excel (I have) but it's easier to understand visually. This is a bit of a larger project I'm working on that just reads the .pdf and .hocr files.

Link to stripped down viewer.

VAPORWARE ALERT

That viewer is just part of a project I'm doing to allow the extraction of structured data from repetitive pdfs. There are always limits to this stuff, but you can be smart about handling them.

The viewer just shows a single document, but a web-app backed version of this has a database, so it can show you every position that the word 'contract' appears on a page, or every page where the word 'contract' appears at the top in the middle.

If this is something that's of interest to you, sign up here to get notified about this project: bit.ly/whatwordwhere -- I'll be beta testing in a month or so.

Splitting, Merging, and Rotating PDFs

It's easy enough to split/merge/rotate documents in your favorite PDF viewer. But when you've got a giant batch of PDFs, or want to perform a complex manipulation, command-line tools are handy. A couple of free tools and libraries:

Coherent PDF / cpdf

PDFtk

Additional Resources

More REGEXes:

Freefcc includes pdf parsers for about 14 broadcast tv stations. One way to do it is lotsa regexes.

Sunlightlabs' U.S. House Disbursements parser; Senate disbursements parser. Sample page

PDF liberation hackathon

Yes, there actually was one. github, web

Handwritten forms

Handwriting is a whole different beast. Abby finereader does ok with it--but don't count it for accuracy.

Academics

You may hear that OCR is a 'solved' problem in the computer science domain (although there are some implementation details that can improved).

There is also work being done in 'layout analysis', and groups like UMD's language and media processing lab do weird research into documents. Papers on the sexy stuff in this field are showcased yearly at the International Conference on Document Analysis and Recognition (in Johannesburg for 2016). UMD's David Doerman now at DARPA ?

Tessaract bindings

Check these out; YMMV.

https://github.com/jflesch/pyocr

https://github.com/meh/ruby-tesseract-ocr

https://pypi.python.org/pypi/pytesseract

Node: https://github.com/creatale/node-dv which is supposed to work with this form extraction thing: https://github.com/creatale/node-fv

Learn More About the PDF Format