Home

Awesome

<div align="center">

English | ็ฎ€ไฝ“ไธญๆ–‡

<h1>DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception</h1>

Official PyTorch implementation of DocLayout-YOLO.

arXiv Online Demo Hugging Face Spaces

</div>

Abstract

We present DocLayout-YOLO, a real-time and robust layout detection model for diverse documents, based on YOLO-v10. This model is enriched with diversified document pre-training and structural optimization tailored for layout detection. In the pre-training phase, we introduce Mesh-candidate BestFit, viewing document synthesis as a two-dimensional bin packing problem, and create a large-scale diverse synthetic document dataset, DocSynth-300K. In terms of model structural optimization, we propose a module with Global-to-Local Controllability for precise detection of document elements across varying scales.

<p align="center"> <img src="assets/comp.png" width=52%> <img src="assets/radar.png" width=44%> <br> </p>

News ๐Ÿš€๐Ÿš€๐Ÿš€

2024.10.25 ๐ŸŽ‰๐ŸŽ‰ Mesh-candidate Bestfit code is released. Mesh-candidate Bestfit is a automatic pipeline which can synthesize large-scale, high-quality, and visually appealing document layout detection dataset. Tutorial and example data are available in here.

2024.10.23 ๐ŸŽ‰๐ŸŽ‰ DocSynth300K dataset is released on ๐Ÿค—Huggingface, DocSynth300K is a large-scale and diverse document layout analysis pre-training dataset, which can largely boost model performance.

2024.10.21 ๐ŸŽ‰๐ŸŽ‰ Online demo available on ๐Ÿค—Huggingface.

2024.10.18 ๐ŸŽ‰๐ŸŽ‰ DocLayout-YOLO is implemented in PDF-Extract-Kit for document context extraction.

2024.10.16 ๐ŸŽ‰๐ŸŽ‰ Paper now available on ArXiv.

Quick Start

Online Demo is now available. For local development, follow steps below:

1. Environment Setup

Follow these steps to set up your environment:

conda create -n doclayout_yolo python=3.10
conda activate doclayout_yolo
pip install -e .

Note: If you only need the package for inference, you can simply install it via pip:

pip install doclayout-yolo

2. Prediction

You can make predictions using either a script or the SDK:

We provide model fine-tuned on DocStructBench for prediction, which is capable of handing various document types. Model can be downloaded from here and example images can be found under assets/example.

<p align="center"> <img src="assets/showcase.png" width=100%> <br> </p>

Note: For PDF content extraction, please refer to PDF-Extract-Kit and MinerU.

Note: Thanks to NielsRogge, DocLayout-YOLO now supports implementation directly from ๐Ÿค—Huggingface, you can load model as follows:

filepath = hf_hub_download(repo_id="juliozhao/DocLayout-YOLO-DocStructBench", filename="doclayout_yolo_docstructbench_imgsz1024.pt")
model = YOLOv10(filepath)

or directly load using from_pretrained:

model = YOLOv10.from_pretrained("juliozhao/DocLayout-YOLO-DocStructBench")

more details can be found at this PR.

Note: Thanks to luciaganlulu, DocLayout-YOLO can perform batch inference and prediction. Instead of passing single image into model.predict in demo.py, pass a list of image path. Besides, due to batch inference is not implemented before YOLOv11, you should manually change batch_size in here.

DocSynth300K Dataset

<p align="center"> <img src="assets/docsynth300k.png" width=100%> </p>

Data Download

Use following command to download dataset(about 113G):

from huggingface_hub import snapshot_download
# Download DocSynth300K
snapshot_download(repo_id="juliozhao/DocSynth300K", local_dir="./docsynth300k-hf", repo_type="dataset")
# If the download was disrupted and the file is not complete, you can resume the download
snapshot_download(repo_id="juliozhao/DocSynth300K", local_dir="./docsynth300k-hf", repo_type="dataset", resume_download=True)

Data Formatting & Pre-training

If you want to perform DocSynth300K pretraining, using format_docsynth300k.py to convert original .parquet format into YOLO format. The converted data will be stored at ./layout_data/docsynth300k.

python format_docsynth300k.py

To perform DocSynth300K pre-training, use this command. We default use 8GPUs to perform pretraining. To reach optimal performance, you can adjust hyper-parameters such as imgsz, lr according to your downstream fine-tuning data distribution or setting.

Note: Due to memory leakage in YOLO original data loading code, the pretraining on large-scale dataset may be interrupted unexpectedly, use --pretrain last_checkpoint.pt --resume to resume the pretraining process.

Training and Evaluation on Public DLA Datasets

Data Preparation

  1. specify the data root path

Find your ultralytics config file (for Linux user in $HOME/.config/Ultralytics/settings.yaml) and change datasets_dir to project root path.

  1. Download prepared yolo-format D4LA and DocLayNet data from below and put to ./layout_data:
DatasetDownload
D4LAlink
DocLayNetlink

the file structure is as follows:

./layout_data
โ”œโ”€โ”€ D4LA
โ”‚ย ย  โ”œโ”€โ”€ images
โ”‚ย ย  โ”œโ”€โ”€ labels
โ”‚ย ย  โ”œโ”€โ”€ test.txt
โ”‚ย ย  โ””โ”€โ”€ train.txt
โ””โ”€โ”€ doclaynet
    โ”œโ”€โ”€ images
 ย ย  โ”œโ”€โ”€ labels
 ย ย  โ”œโ”€โ”€ val.txt
 ย ย  โ””โ”€โ”€ train.txt

Training and Evaluation

Training is conducted on 8 GPUs with a global batch size of 64 (8 images per device). The detailed settings and checkpoints are as follows:

DatasetModelDocSynth300K Pretrained?imgszLearning rateFinetuneEvaluationAP50mAPCheckpoint
D4LADocLayout-YOLOโœ—16000.04commandcommand81.769.8checkpoint
D4LADocLayout-YOLOโœ“16000.04commandcommand82.470.3checkpoint
DocLayNetDocLayout-YOLOโœ—11200.02commandcommand93.077.7checkpoint
DocLayNetDocLayout-YOLOโœ“11200.02commandcommand93.479.7checkpoint

The DocSynth300K pretrained model can be downloaded from here. Change checkpoint.pt to the path of model to be evaluated during evaluation.

Acknowledgement

The code base is built with ultralytics and YOLO-v10.

Thanks for their great work!

Citation

@misc{zhao2024doclayoutyoloenhancingdocumentlayout,
      title={DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception}, 
      author={Zhiyuan Zhao and Hengrui Kang and Bin Wang and Conghui He},
      year={2024},
      eprint={2410.12628},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.12628}, 
}

@article{wang2024mineru,
  title={MinerU: An Open-Source Solution for Precise Document Content Extraction},
  author={Wang, Bin and Xu, Chao and Zhao, Xiaomeng and Ouyang, Linke and Wu, Fan and Zhao, Zhiyuan and Xu, Rui and Liu, Kaiwen and Qu, Yuan and Shang, Fukai and others},
  journal={arXiv preprint arXiv:2409.18839},
  year={2024}
}