Home

Awesome

Data extractor for PDF invoices - invoice2data

Read the documentation at https://invoice2data.readthedocs.io/ invoice2data build status on GitHub Actions Version Support Python versions License Tests Codecov pre-commit Ruff

A command line tool and Python library that automates the extraction of key information from invoices to support your accounting process. The library is very flexible and can be used on other types of business documents as well.

In essence, invoice2data simplifies the process of getting data from invoices by:

Automating text extraction: No more manual copying and pasting. Using templates for structure: Handles different invoice layouts. Providing structured output: Makes the data ready for analysis or further processing. This makes it a valuable tool for businesses and developers dealing with a large volume of invoices, saving time and reducing errors associated with manual data entry.

  1. extracts text from PDF files using different techniques, like pdftotext, text, ocrmypdf, pdfminer, pdfplumber or OCR -- tesseract, or gvision (Google Cloud Vision).
  2. searches for regex in the result using a YAML or JSON-based template system
  3. saves results as CSV, JSON or XML or renames PDF files to match the content.

With the flexible template system you can:

Go from PDF files to this:

{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}

Usage

Basic usage. Process PDF files and write result to CSV. Please see the Command-line Reference for details.

Choose any of the following input readers:

Choose any of the following output formats:

Save output file with custom name or a specific folder

invoice2data --output-format csv --output-name myinvoices/invoices.csv invoice.pdf

Note: You must specify the output-format in order to create output-name

Specify folder with yml templates. (e.g. your suppliers)

invoice2data --template-folder ACME-templates invoice.pdf

Only use your own templates and exclude built-ins

invoice2data --exclude-built-in-templates --template-folder ACME-templates invoice.pdf

Processes a folder of invoices and copies renamed invoices to new folder.

invoice2data --copy new_folder folder_with_invoices/*.pdf

Processes a single file and dumps whole file for debugging (useful when adding new templates in templates.py)

invoice2data --debug my_invoice.pdf

Recognize test invoices: invoice2data invoice2data/test/pdfs/* --debug

Use as Python Library

You can easily add invoice2data to your own Python scripts as library.

from invoice2data import extract_data
result = extract_data('path/to/my/file.pdf')

Using in-house templates

from invoice2data import extract_data
from invoice2data.extract.loader import read_templates

templates = read_templates('/path/to/your/templates/')
result = extract_data(filename, templates=templates)

Template system

See invoice2data/extract/templates for existing templates. Just extend the list to add your own. If deployed by a bigger organisation, there should be an interface to edit templates for new suppliers. 80-20 rule. For a short tutorial on how to add new templates, see tutorial.md.

Templates are based on Yaml or JSON. They define one or more keywords to find the right template, one or more exclude_keywords to further narrow it down and regexp for fields to be extracted. They could also be a static value, like the full company name.

Template files are tried in alphabetical order.

We may extend them to feature options to be used during invoice processing.

Example:

    issuer: Amazon Web Services, Inc.
    keywords:
    - Amazon Web Services
    exclude_keywords:
    - San Jose
    fields:
      amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
      amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
      date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
      invoice_number: Invoice Number:\s+(\d+)
      partner_name: (Amazon Web Services, Inc\.)
    options:
      remove_whitespace: false
      currency: HKD
      date_formats:
        - '%d/%m/%Y'
    lines:
        start: Detail
        end: \* May include estimated US sales tax
        first_line: ^    (?P<description>\w+.*)\$(?P<price_unit>\d+\.\d+)
        line: (.*)\$(\d+\.\d+)
        skip_line: Note
        last_line: VAT \*\*

The lines package has multiple settings:

:warning: Invoice2data uses a yaml templating system. The yaml templates are loaded with pyyaml which is a pure python implementation. (thus rather slow) As an alternative json templates can be used. Which are natively better supported by python.

The performance with yaml templates can be greatly increased 10x by using libyaml It can be installed on most distributions by: sudo apt-get libyaml-dev

Development

If you are interested in improving this project, have a look at our developer guide to get you started quickly.

Roadmap and open tasks

Maintainers

Contributors and Credits

Contributions are very welcome. To learn more, see the Contributor Guide.

Used By

Related Projects

<!-- github-only -->