Awesome

pdftables - a library for extracting tables from PDF files

This post on the ScraperWiki blog describes the algorithms used in pdftables, and something of its genesis. This README gives more technical information.pdftables uses pdfminer to get information on the locations of text elements in a PDF document. pdfminer was chosen as a base because it provides information on the full range of page elements in PDF files, including graphical elements such as lines. Although the algorithms currently used do not use these elements they are planned for future work. As a purely Python library, pdfminer is very portable. The downside of pdfminer is that it is slow, perhaps an order of magnitude slower than alternative C based libraries.

Usage

First we get a file handle to a PDF:

filepath = os.path.join(PDF_TEST_FILES,SelectedPDF)
fh = open(filepath,'rb')

Then we use our get_pdf_page function to selection a single page from the document:

pdf_page = get_pdf_page(fh, pagenumber)    
table,diagnosticData = page_to_tables(pdf_page, extend_y = False, hints = hints, atomise = False)

Setting the optional extend_y parameter to True extends the grid used to extract the table to the full height of the page. The optional hints parameter is a two element string array, the first element should contain unique text at the top of the table, the second element should contain unique text from the bottom row of the table. Setting the optional atomise parameter to True converts all the text to individual characters this will be slower but will sometimes split closely separated columns.

table is a list of lists of strings. diagnosticData is an object containing diagnostic information which can be displayed using the plotpage function:

fig,ax1 = plotpage(diagnosticData)

Files and Folders

 .
 |-fixtures
 |---actual_output
 |---expected_output
 |---sample_data
 |-pdftables
 |-test

fixtures contains test fixtures, in particular the sample_data directory contains PDF files which are installed from a different repository by running the download_test_data.sh script.

The actual_output and expected_output directories are currently unused.

test contains tests

pdftables contains the core code files

pdftables.py - this is the core of the pdftables library. It contains two entry point functions (page_to_tables and get_tables). page_to_tables handles a single page of a document and allows the use of options in finding the table. get_tables takes a file handle and returns a list of all the tables in the document.

pdftables can also be run from the commandline:

pdftables.py <file.pdf>

Will convert all the tables found in <file.pdf> to a string format.

counter.py - implements collections.Counter for the benefit of Python 2.6

display.py - prettily prints a table by implementing the to_string function

numpy_subset.py - partially implements numpy.diff, numpy.arange and numpy.average to avoid a large dependency on numpy.

pdf_document.py - implements PDFDocument to abstract away the underlying PDF class, and ease any conversion to a different underlying PDF library to replace PDFminer

pdftables_analysis.py - uses the matplotlib library to make visualisations of the elements found in PDF documents and also features of the table analysis algorithm

runtables.py - is my scientist-style harness to run pdftables, likely to be depreciated by my more software engineering colleagues!

tree.py - implements the structure which holds the PDF document elements on which pdftables operates.

Installing test set files

Files used in testing are stored in a separate repository and can be installed by executing the script:

download_test_data.sh