Home

Awesome

<p align="center"> <img src="https://sergey-filimonov.nyc3.digitaloceanspaces.com/open-parse/open-parse-with-text-tp-logo.webp" width="350" /> </p> <br/>

Easily chunk complex documents the same way a human would.

Chunking documents is a challenging task that underpins any RAG system. High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.

Open Parse is designed to fill this gap by providing a flexible, easy-to-use library capable of visually discerning document layouts and chunking them effectively.

<details> <summary><b>How is this different from other layout parsers?</b></summary>

✂️ Text Splitting

Text splitting converts a file to raw text and slices it up.

🤖 ML Layout Parsers

There's some of fantastic libraries like layout-parser.

💼 Commercial Solutions

</details>

Highlights

<br/> <p align="center"> <img src="https://sergey-filimonov.nyc3.digitaloceanspaces.com/open-parse/marked-up-doc-2.webp" width="250" /> </p>

Example

Basic Example

import openparse

basic_doc_path = "./sample-docs/mobile-home-manual.pdf"
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)

for node in parsed_basic_doc.nodes:
    print(node)

📓 Try the sample notebook <a href="https://colab.research.google.com/drive/1Z5B5gsnmhFKEFL-5yYIcoox7-jQao8Ep?usp=sharing" class="external-link" target="_blank">here</a>

Semantic Processing Example

Chunking documents is fundamentally about grouping similar semantic nodes together. By embedding the text of each node, we can then cluster them together based on their similarity.

from openparse import processing, DocumentParser

semantic_pipeline = processing.SemanticIngestionPipeline(
    openai_api_key=OPEN_AI_KEY,
    model="text-embedding-3-large",
    min_tokens=64,
    max_tokens=1024,
)
parser = DocumentParser(
    processing_pipeline=semantic_pipeline,
)
parsed_content = parser.parse(basic_doc_path)

📓 Sample notebook <a href="https://github.com/Filimoa/open-parse/blob/main/src/cookbooks/semantic_processing.ipynb" class="external-link" target="_blank">here</a>

Serializing Results

Uses pydantic under the hood so you can serialize results with

parsed_content.dict()

# or to convert to a valid json dict
parsed_content.json()

Requirements

Python 3.8+

Dealing with PDF's:

Extracting Tables:

Installation

1. Core Library

pip install openparse

Enabling OCR Support:

PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, so installation of Tesseract-OCR is still required.

The language support folder location must be communicated either via storing it in the environment variable "TESSDATA_PREFIX", or as a parameter in the applicable functions.

So for a working OCR functionality, make sure to complete this checklist:

  1. Install Tesseract.

  2. Locate Tesseract’s language support folder. Typically you will find it here:

    • Windows: C:/Program Files/Tesseract-OCR/tessdata

    • Unix systems: /usr/share/tesseract-ocr/5/tessdata

    • macOS (installed via Homebrew):

      • Standard installation: /opt/homebrew/share/tessdata
      • Version-specific installation: /opt/homebrew/Cellar/tesseract/<version>/share/tessdata/
  3. Set the environment variable TESSDATA_PREFIX

    • Windows: setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"

    • Unix systems: declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata

    • macOS (installed via Homebrew): export TESSDATA_PREFIX=$(brew --prefix tesseract)/share/tessdata

Note: On Windows systems, this must happen outside Python – before starting your script. Just manipulating os.environ will not work!

2. ML Table Detection (Optional)

This repository provides an optional feature to parse content from tables using a variety of deep learning models.

pip install "openparse[ml]"

Then download the model weights with

openparse-download

You can run the parsing with the following.

parser = openparse.DocumentParser(
        table_args={
            "parsing_algorithm": "unitable",
            "min_table_confidence": 0.8,
        },
)
parsed_nodes = parser.parse(pdf_path)

Note we currently use table-transformers for all table detection and we find its performance to be subpar. This negatively affects the downstream results of unitable. If you're aware of a better model please open an Issue - the unitable team mentioned they might add this soon too.

Cookbooks

https://github.com/Filimoa/open-parse/tree/main/src/cookbooks

Documentation

https://filimoa.github.io/open-parse/

Sponsors

<!-- sponsors -->

<a href="https://www.data.threesigma.ai/filings-ai" target="_blank" title="Three Sigma: AI for insurance filings."><img src="https://sergey-filimonov.nyc3.digitaloceanspaces.com/open-parse/marketing/three-sigma-wide.png" width="250"></a>

<!-- /sponsors -->

Does your use case need something special? Reach out.