Home

Awesome

Marker API

[!IMPORTANT]

Marker API provides a simple endpoint for converting PDF documents to Markdown quickly and accurately. With just one click, you can deploy the Marker API endpoint and start converting PDFs seamlessly.

Features

Comparison

Original PDFMarker-APIPyPDF
Original PDFMarker-APIPyPDF

Installation and Setup

šŸ Python

To install Marker API in a Python environment, follow these steps:

  1. Clone the Marker API repository from GitHub:
git clone https://github.com/adithya-s-k/marker-api
  1. Navigate to the cloned repository directory:
cd marker-api
  1. Install the dependencies using the following commands:

poetry install or pip install -e .

After installation, you can run the server through marker_api command

marker_api

or

python server.py

šŸ›³ļø Docker

To use Marker API with Docker, execute the following commands:

  1. Pull the Marker API Docker image from Docker Hub:
  2. Run the Docker container, exposing port 8000: šŸ‘‰šŸ¼Docker Image
docker pull savatar101/marker-api:0.3
# if you are running on a gpu 
docker run --gpus all -p 8000:8000 savatar101/marker-api:0.3
# else
docker run -p 8000:8000 savatar101/marker-api:0.3

Alternatively, if you prefer to build the Docker image locally: Then, run the Docker container as follows:

docker build -t marker-api .
# if you are running on a gpu
docker run --gpus all -p 8000:8000 marker-api
# else
docker run -p 8000:8000 marker-api

āœˆļø Skypilot

SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution. To deploy Marker API using Skypilot on any cloud provider, execute the following command:

pip install skypilot-nightly[all]

# setup skypilot with the cloud provider our your

sky launch skypilot.yaml

please refer to skypilot documentation for more information.

Usage

API Client Code: Open In Colab

Endpoint

Request

Response

Invoke Endpoint

CURL

curl -X POST \
  -F "pdf_file=@example.pdf;type=application/pdf" \
  http://localhost:8000/convert

Python

Please refer to examples on how to invoke the api and save it as Markdown Notebook , Script

import requests
import os

url = "http://localhost:8000/convert"
pdf_file_path = "example.pdf"
with open(pdf_file_path, 'rb') as pdf_file:
    pdf_content = pdf_file.read()
files = {'pdf_file': (os.path.basename(pdf_file_path), pdf_content, 'application/pdf')}
response = requests.post(url, files=files)

print(response.json())

JavaScript

const fetch = require('node-fetch');
const fs = require('fs');

const url = "http://localhost:8000/convert";
const pdfFilePath = "example.pdf";

fs.readFile(pdfFilePath, (err, pdfContent) => {
    if (err) {
        console.error(err);
        return;
    }

    const formData = new FormData();
    formData.append('pdf_file', new Blob([pdfContent], { type: 'application/pdf' }), pdfFilePath);

    fetch(url, {
        method: 'POST',
        body: formData
    })
    .then(response => response.json())
    .then(data => console.log(data))
    .catch(error => console.error('Error:', error));
});
<details> <summary><h3>Marker Readme</h3></summary>

Marker converts PDF to markdown quickly and accurately.

How it works

Marker is a pipeline of deep learning models:

It only uses models where necessary, which improves speed and accuracy.

Examples

PDFTypeMarkerNougat
Think PythonTextbookViewView
Think OSTextbookViewView
Switch TransformersarXiv paperViewView
Multi-column CNNarXiv paperViewView

Performance

Benchmark overall

The above results are with marker and nougat setup so they each take ~4GB of VRAM on an A6000.

See below for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.

Commercial usage

I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.

The weights for the models are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options here.

Community

Discord is where we discuss future development.

Limitations

PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:

Installation

You'll need python 3.9+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.

Install with:

pip install marker-pdf

Optional: OCRMyPDF

Only needed if you want to use the optional ocrmypdf as the ocr backend. Note that ocrmypdf includes Ghostscript, an AGPL dependency, but calls it via CLI, so it does not trigger the license provisions.

See the instructions here

Usage

First, some configuration:

Convert a single file

marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10 --langs English

Make sure the DEFAULT_LANG setting is set appropriately for your document. The list of supported languages for OCR is here. If you need more languages, you can use any language supported by Tesseract if you set OCR_ENGINE to ocrmypdf. If you don't need OCR, marker can work with any language.

Convert multiple files

marker /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000
{
  "pdf1.pdf": {"languages": ["English"]},
  "pdf2.pdf": {"languages": ["Spanish", "Russian"]},
  ...
}

You can use language names or codes. The exact codes depend on the OCR engine. See here for a full list for surya codes, and here for tesseract.

Convert multiple files on multiple GPUs

MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out

Note that the env variables above are specific to this script, and cannot be set in local.env.

Troubleshooting

There are some settings that you may find useful if things aren't working the way you expect:

In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.

Benchmarks

Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.

Benchmarks show that marker is 4x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison.

Speed

MethodAverage ScoreTime per pageTime per document
marker0.6137210.63199158.1432
nougat0.4066032.59702238.926

Accuracy

First 3 are non-arXiv books, last 3 are arXiv papers.

Methodmulticolcnn.pdfswitch_trans.pdfthinkpython.pdfthinkos.pdfthinkdsp.pdfcrowd.pdf
marker0.5361760.5168330.705150.7106570.6900420.523467
nougat0.440090.5889730.3227060.4013420.1608420.525663

Peak GPU memory usage during the benchmark is 4.2GB for nougat, and 4.1GB for marker. Benchmarks were run on an A6000 Ada.

Throughput

Marker takes about 4.5GB of VRAM on average per task, so you can convert 10 documents in parallel on an A6000.

Benchmark results

Running your own benchmarks

You can benchmark the performance of marker on your machine. Install marker manually with:

git clone https://github.com/VikParuchuri/marker.git
poetry install

Download the benchmark data here and unzip. Then run benchmark.py like this:

python benchmark.py data/pdfs data/references report.json --nougat

This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.

Omit --nougat to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.

Thanks

This work would not have been possible without amazing open source models and datasets, including (but not limited to):

Thank you to the authors of these models and datasets for making them available to the community!

</details>

To Do

Throughput Benchmarks

Updates on throughput benchmarks will be available soon.

Acknowledgements

This project is built on top of the remarkable marker project created by VikParuchuri. We express our gratitude for the inspiration and foundation provided by this project.

<p align="center"> <a href="https://adithyask.com"> <img src="https://api.star-history.com/svg?repos=adithya-s-k/marker-api&type=Date" alt="Star History Chart"> </a> </p>