Home

Awesome

A Benchmark & Evaluation for Text Extraction from PDF

This project is about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles. It provides (1) a benchmark generator, (2) a ready-to-use benchmark and (3) an extensive evaluation, with meaningful evaluation criteria.

The Benchmark Generator

For more details and usage, see benchmark-generator/.

The Benchmark

For more details, see benchmark/.

The Evaluation

For more details, see evaluation/.