Awesome
bit.ly/parsing-prickly-pdfs
Resources and worksheet for the NICAR 2016 workshop of the same name. Instructors: Jacob Fenton (jsfenfen@gmail.com) and Jeremy Singer-Vine (jsvine@gmail.com).
What We'll Cover
- The One Weird Thing You Need To Know About PDFs
- Optical Character Recognition
- Extracting Structured Data From PDFs
- Splitting, Merging, and Rotating PDFs
- Additional Resources
The One Weird Thing You Need To Know About PDFs
-
It's a print format--the point is to tell a printer what things to print where. There's not much more order than that. But for our purposes we'll talk about three types of PDFs:
-
"Text-based PDFs": The document has characters, fonts and font sizes, and information about where to place them stored in the document in a format that (a program) can read. Created by programs that can save to pdf directly. Dragging the mouse will highlight text.
-
"Image-based PDFs": The document has images of pages, but not words themselves. Typically the result of scanning. You can't highlight text.
-
"Embedded-text PDFs": The document has images of pages, but there's invisible text 'attached' to the document, so you can select text. Typically created by scanners that also run OCR. Sticky wicket: should you assume the attached text is better than what you'd get by running tesseract? Not necessarily (but it probably is...)
-
Optical Character Recognition
Optical Character Recognition (OCR) tools try to extract text from image-based PDFs. Depending on the software, the tool may either (a) convert your image-based PDF into an embedded-text PDF, or (b) simply output the text it has found. Some popular options:
GUI (Free)
- DocumentCloud (if you're willing to wait in queue)
- Overview will now give you searchable pdfs back.
- CometDocs (web-based, freemium)
GUI (Paid)
-
ABBYY Fine Reader ($120 Mac / $170 Windows). Here's a scanned doc released by an Oregon county gov; it's much better with abby.
-
Adobe Acrobat DC (free trial / $15/mo / $450 purchase)
-
Cogniview PDF2XL Windows only. Free trial. OCR edition: $299; "Enterprise":$399.
-
Datawatch Monarch Specialty tool for structured data extraction; dunno if OCR is included. Free "personal" edition with 'limited imports and exports'; 30-day free trial for enterprise edition. They claim Classic Edition ($895 per user per year) and Complete Edition ($1,595 per user per year).
-
Abby Flexicapture. Believe that Abby charges less in other countries, and there's an industry of document conversion services in other countries. Price varies by country. (~ $24K)
Command-Line
- Tesseract
- Description: Tesseract is free / open source OCR software originally developed by HP, now by Google. Support for many languages is available. In general the quality is lower than paid OCR, but it's great for processing giant volumes of documents when desktop processing isn't feasible.
- Installation: See homepage. On Macs, can use
brew install tesseract
; on Ubuntu trysudo apt-get install tesseract-ocr tesseract-ocr-eng
. - Example usage:
tesseract imagefile.png convertedimage
will extract English text from convertedimage.txt.tesseract imagefile.png convertedimage /path/to/configfile
will run the conversion with options specified in the configfile; one useful option (see simple configfile )is making it output in the hOCR format, which includes bounding boxes.tesseract imagefile.png image_pdf_with_embedded_text pdf
will OCR imagefile.png and create a new pdf file called image_pdf_with_embedded_text.pdf that has text embedded in it (aka it's a searchable pdf). This is only available in v 3.03 and higher.
Extracting Structured Data From PDFs
Tabula
Tabula helps you extract data tables from PDFs. It's free and open-source, and uses an intuitive graphical interface.
pdftotext
pdftotext
is the simplest way to convert a PDF to raw text.
- Installation:
brew install poppler
(OSX) /sudo apt-get install poppler-utils
(Ubuntu) - Tips:
- Can help to use -f [first page] and -l [last page] to split a long document into more readable pieces (which can be parsed in order later.)
- It's usually best to use the
-layout
flag to makepdftotext
use whitespace to approximate the document's physical layout. This does a decent job of approximating linebreaks, though multi-column text can mess things up. - In combination with regular expressions, you can parse surprisingly complex documents.
- The
-bbox
option spits out word-level bounding box information, but not fonts. This is a quick-and-dirty way to pull this info for scanned documents (where the fonts probably aren't that reliable anyways)
pdftohtml
pdftohtml
comes from the same Poppler suite of tools as pdftotext
.
- Homepage
- Installation:
brew install poppler
(OSX) /sudo apt-get install poppler-utils
(Ubuntu) - Tips:
- Use the
-xml
flag to include location information for text blocks (i.e. the number of pixels to the top of the page and to the left of the page; the width and height of the text box; the font face and size can be figured out from the styling). This can be especially helpful when each cell of a chart is represented as a text block. - You can parse the result like you'd parse any other HTML/XML document. (Example.)
- The
-c
option creates .html pages — one for each page — which can come in handy for dead-simple display. It's not quite as useful as the-xml
output for analysis; it includes the distance to the top and left of the page, but omits the text block width and height.
- Use the
- Real-world example
tabula-java
tabula-java
is the Java library underpinning Tabula, and command-line tool that lets you automate table-extraction.
- Homepage
- Requirements:
java
andmvn
- Installation and usage
PDFPlumber
PDFPlumber is a Python library and command-line tool for extracting information from PDFs. Both tools provide granular information about each character, rectangle, and line. The library also has Tabula-style features for extracting tables and text.
Structured information from Tesseract
Preprocessing
This is the simplified version: see full details in the examples/WFLX dir.
Tesseract operates on image files, so you'll need to convert pdfs to images first. The simplest way is probably to use imagemagick. For installation see here.
For examples/WFLX/sample_contract.pdf, convert the first page with: convert -density 300 ./sample_contract.pdf[1] ./sample_contract_p1.png
; the result is here.
Get structured info from that file
Pay attention to tesseract versions. 3.04 is current; 3.01 is needed for bounding box stuff.
Run tesseract with this one-line config file, which tells it to output files in the hOCR format. tesseract sample_contract_p1.png p1_hocr ./configfile
.
There's a hack here to convert the resulting file examples/WFLX/p1_hocr.hocr to csv.
$ python convert_hocr.py examples/WFLX/p1_hocr.hocr examples/WFLX/p1_hocr.csv
The output will be examples/WFLX/p1_hocr.csv which has word level bounding boxes.
This is starting to look a lot like 'regular' PDF
With a little bit of tooling, we've managed to make image-based PDFs look like regular PDFs: a csv (or json file) of words and their bounding boxes. This is pretty significant because we can use the same strategy to parse image-based pdfs as text-based PDFs, with a few caveats:
- The font / fontsize information isn't available
- Lower text quality requires more fuzzy matching
- Alignment / image quality issues loom larger
The longer view: making it visual
What do you do with bounding box data. You can stare at it in Excel (I have) but it's easier to understand visually. This is a bit of a larger project I'm working on that just reads the .pdf and .hocr files.
Link to stripped down viewer.
VAPORWARE ALERT
That viewer is just part of a project I'm doing to allow the extraction of structured data from repetitive pdfs. There are always limits to this stuff, but you can be smart about handling them.
The viewer just shows a single document, but a web-app backed version of this has a database, so it can show you every position that the word 'contract' appears on a page, or every page where the word 'contract' appears at the top in the middle.
If this is something that's of interest to you, sign up here to get notified about this project: bit.ly/whatwordwhere -- I'll be beta testing in a month or so.
Splitting, Merging, and Rotating PDFs
It's easy enough to split/merge/rotate documents in your favorite PDF viewer. But when you've got a giant batch of PDFs, or want to perform a complex manipulation, command-line tools are handy. A couple of free tools and libraries:
Coherent PDF / cpdf
- Homepage
- User guide
- Example usage:
cpdf -split original.pdf -o original-split-%%%.pdf -chunk 10
. Splitsoriginal.pdf
into 10-page chunks, titledoriginal-split-000.pdf
,original-split-001.pdf
, and so on.cpdf -merge original-split-*.pdf -o original-merged.pdf
. Rejoins all PDFs matching the patternoriginal-split-*.pdf
into a single file.cpdf -rotateby 90 original.pdf 2-5,12-15 -o original-rotated.pdf
. Rotates pages 2-5 and 12-15 by 90 degrees clockwise.
PDFtk
- Homepage
- User guide. On Mac OS 10.11 see here.
- Example usage:
pdftk original.pdf burst output original-split-%023.pdf
. Splitsoriginal.pdf
into single-page PDFs, titledoriginal-split-000.pdf
,original-split-001.pdf
, and so on.pdftk original-split-*.pdf cat output original-merged.pdf
. Rejoins all PDFs matching the patternoriginal-split-*.pdf
into a single file.pdftk original.pdf cat 1 2-5right 6-11 12-15right 15-end output original-rotated.pdf
. Rotates pages 2-5 and 12-15 by 90 degrees clockwise.pdftk infile.pdf cat 1 output output_p1.pdf
. Puts just the first page in output_p1.pdf. Usecat 2-4
to put pages 2 through 4, for example.
Additional Resources
More REGEXes:
Freefcc includes pdf parsers for about 14 broadcast tv stations. One way to do it is lotsa regexes.
Sunlightlabs' U.S. House Disbursements parser; Senate disbursements parser. Sample page
PDF liberation hackathon
Yes, there actually was one. github, web
Handwritten forms
Handwriting is a whole different beast. Abby finereader does ok with it--but don't count it for accuracy.
- Captricity Will do document conversions of handwritten stuff for lots of money. Former Code for America folks--ask them for a discount. Gives (or gave?) a break to nonprofits.
Academics
You may hear that OCR is a 'solved' problem in the computer science domain (although there are some implementation details that can improved).
There is also work being done in 'layout analysis', and groups like UMD's language and media processing lab do weird research into documents. Papers on the sexy stuff in this field are showcased yearly at the International Conference on Document Analysis and Recognition (in Johannesburg for 2016). UMD's David Doerman now at DARPA ?
Tessaract bindings
Check these out; YMMV.
https://github.com/jflesch/pyocr
https://github.com/meh/ruby-tesseract-ocr
https://pypi.python.org/pypi/pytesseract
Node: https://github.com/creatale/node-dv which is supposed to work with this form extraction thing: https://github.com/creatale/node-fv
Learn More About the PDF Format
-
PDF Explained (O'Reilly)