Home

Awesome

<div align="center" style="margin-top: 20px"> <a href="https://yobix.ai"> <img height="28px" alt="yobix ai logo" src="https://framerusercontent.com/images/zaqayjWBWNoQmV9MIwSEKf0HBo.png?scale-down-to=512"> </a> <h1 style="margin-top: 0; padding-top: 0">Extractous</h1> </div> <div align="center">

<a href="https://github.com/yobix-ai/extractous/blob/main/LICENSE">https://pypi.python.org/pypi/unstructured/</a> <img src="https://img.shields.io/github/commit-activity/m/yobix-ai/extractous" alt="Commits per month"> Downloads

</div> <div align="center">

Extractous offers a fast and efficient solution for extracting content and metadata from various documents types such as PDF, Word, HTML, and many other formats. Our goal is to deliver a fast and efficient comprehensive solution in Rust with bindings for many programming languages.

</div>

Demo: showing that Extractous 🚀 is 25x faster than the popular unstructured-io library ($65m in funding and 8.5k GitHub stars). For complete benchmarking details please consult our benchmarking repository

unstructured_vs_extractous <sup>* demo running at 5x recoding speed</sup>

Why Extractous?

Extractous was born out of frustration with the need to rely on external services or APIs for content extraction from unstructured data. Do we really need to call external APIs or run special servers just for content extraction? Couldn't extraction be performed locally and efficiently?

In our search for solutions, unstructured-io stood out as the popular and widely-used library for parsing unstructured content with in-process parsing. However, we identified several significant limitations:

In contrast, Extractous maintains a dedicated focus on text and metadata extraction. It achieves significantly faster processing speeds and lower memory utilization through native code execution.

With Extractous, the need for external services or APIs is eliminated, making data processing pipelines faster and more efficient.

🌳 Key Features

🚀 Quickstart

Extractous provides a simple and easy-to-use API for extracting content from various file formats. Below are quick examples:

Python

from extractous import Extractor

# Create a new extractor
extractor = Extractor()
extractor = extractor.set_extract_string_max_length(1000)
# if you need an xml
# extractor = extractor.set_xml_output(True)

# Extract text from a file
result, metadata = extractor.extract_file_to_string("README.md")
print(result)
print(metadata)
from extractous import Extractor

extractor = Extractor()
# if you need an xml
# extractor = extractor.set_xml_output(True)

# for file
reader, metadata = extractor.extract_file("tests/quarkus.pdf")
# for url
# reader, metadata = extractor.extract_url("https://www.google.com")
# for bytearray
# with open("tests/quarkus.pdf", "rb") as file:
#     buffer = bytearray(file.read())
# reader, metadata = extractor.extract_bytes(buffer)

result = ""
buffer = reader.read(4096)
while len(buffer) > 0:
    result += buffer.decode("utf-8")
    buffer = reader.read(4096)

print(result)
print(metadata)

You need to have Tesseract installed with the language pack. For example on debian sudo apt install tesseract-ocr tesseract-ocr-deu

from extractous import Extractor, TesseractOcrConfig

extractor = Extractor().set_ocr_config(TesseractOcrConfig().set_language("deu"))
result, metadata = extractor.extract_file_to_string("../../test_files/documents/eng-ocr.pdf")

print(result)
print(metadata)

Rust

use extractous::Extractor;

fn main() {
    // Create a new extractor. Note it uses a consuming builder pattern
    let mut extractor = Extractor::new().set_extract_string_max_length(1000);
    // if you need an xml
    // extractor = extractor.set_xml_output(true);

    // Extract text from a file
    let (text, metadata) = extractor.extract_file_to_string("README.md").unwrap();
    println!("{}", text);
    println!("{:?}", metadata);
}
use std::io::{BufReader, Read};
// use std::fs::File; use for bytes
use extractous::Extractor;

fn main() {
    // Get the command-line arguments
    let args: Vec<String> = std::env::args().collect();
    let file_path = &args[1];

    // Extract the provided file content to a string
    let extractor = Extractor::new();
    // if you need an xml
    // extractor = extractor.set_xml_output(true);

    let (stream, metadata) = extractor.extract_file(file_path).unwrap();
    // Extract url
    // let (stream, metadata) = extractor.extract_url("https://www.google.com/").unwrap();
    // Extract bytes
    // let mut file = File::open(file_path)?;
    // let mut buffer = Vec::new();
    // file.read_to_end(&mut buffer)?;
    // let (stream, metadata) = extractor.extract_bytes(&file_bytes);

    // Because stream implements std::io::Read trait we can perform buffered reading
    // For example we can use it to create a BufReader
    let mut reader = BufReader::new(stream);
    let mut buffer = Vec::new();
    reader.read_to_end(&mut buffer).unwrap();

    println!("{}", String::from_utf8(buffer).unwrap());
    println!("{:?}", metadata);
}

You need to have Tesseract installed with the language pack. For example on debian sudo apt install tesseract-ocr tesseract-ocr-deu

use extractous::Extractor;

fn main() {
  let file_path = "../test_files/documents/deu-ocr.pdf";

    let extractor = Extractor::new()
          .set_ocr_config(TesseractOcrConfig::new().set_language("deu"))
          .set_pdf_config(PdfParserConfig::new().set_ocr_strategy(PdfOcrStrategy::OCR_ONLY));
    // extract file with extractor
  let (content, metadata) = extractor.extract_file_to_string(file_path).unwrap();
  println!("{}", content);
  println!("{:?}", metadata);
}

🔥 Performance

extractous_speedup_relative_to_unstructured

extractous_memory_efficiency_relative_to_unstructured

extractous_memory_efficiency_relative_to_unstructured

📄 Supported file formats

CategorySupported FormatsNotes
Microsoft OfficeDOC, DOCX, PPT, PPTX, XLS, XLSX, RTFIncludes legacy and modern Office file formats
OpenOfficeODT, ODS, ODPOpenDocument formats
PDFPDFCan extracts embedded content and supports OCR
SpreadsheetsCSV, TSVPlain text spreadsheet formats
Web DocumentsHTML, XMLParses and extracts content from web documents
E-BooksEPUBEPUB format for electronic books
Text FilesTXT, MarkdownPlain text formats
ImagesPNG, JPEG, TIFF, BMP, GIF, ICO, PSD, SVGExtracts embedded text with OCR
E-MailEML, MSG, MBOX, PSTExtracts content, headers, and attachments

🤝 Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or new features to propose.

🕮 License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.