Home

Awesome

arXiv GitHub issues PRs Welcome

DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

We present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline DocParser. DocGenome features four characteristics:

Release

<div align=center> <img src="assets/motivation.png" height="95%"> </div>

File Structure

Please refer to Dataset_Details_README for the detalied explanation about the different file structures in DocGenome.

DocGenome Benchmark Introduction

Datasets# Discipline# Category of Units# Pages in Train-set# Pages in Test-set# Task# Used MetricPublicationEntity Relations
DocVQA-N/A11K1K121960-2000❎
DocLayNet-1180K8K11-❎
DocBank-130.45M50K312014-2018❎
PubLayNet-50.34M12K11-❎
VRDU-107K3K31-❎
DUDE-N/A20K6K331860-2022❎
D^4LA-278K2K13-❎
Fox Benchmark-5N/A (No train-set)0.2K35-❎
ArXivCap32N/A6.4M*N/A43-❎
DocGenome (ours)153136.8M9K772007-2022βœ…

 

πŸ‘‡πŸ»DocGenome-train Download

We provide 8 subsets of DocGenome-train for downloading:

<details> <summary> Data Download</summary> </details>

Definition of relationships between component units

DocGenome contains 4 level relation types and 2 cite relation types, as shown in the following table:

NameDescriptionExample
IdenticalTwo blocks share the same source code.Cross-column text; Cross-page text.
Title adjacentThe two titles are adjacent.(\section{introduction}, \section{method})
SubordinateOne block is a subclass of another block.(\section{introduction}, paragraph within Introduction)
Non-title adjacentThe two text or equation blocks are adjacent.(Paragraph 1, Paragraph 2)
Explicitly-referredOne block refers to another block via footnote, reference, etc.(As shown in \ref{Fig: 5} ..., Figure 5)
Implicitly-referredThe caption block refers to the corresponding float environment.(Table Caption 1, Table 1)
</details>

Attribute of component units

DocGenome has 13 attributes of component units, which can be categorized into two classes

IndexCategoryNotes
0Algorithm
1CaptionTitles of Images, Tables, and Algorithms
2Equation
3Figure
4Footnote
5List
7Table
8Text
9Text-EQText block with inline equations
10TitleSection titles
12PaperTitle
13Code
14Abstract

Types of disciplines

Page distribution of DocGenome. 20% of documents are five pages or fewer, 50% are ten pages or fewer, and 80% are nineteen pages or fewer.

<details> <summary> Page Distribution</summary> <div align=center> <img src="assets/page_distribution.png" height="500"> </div> </details>

 

Distribution of secondary disciplines in our DocGenome. The count on the x-axis represents the number of documents, and documents from the same primary discipline are marked with the same color.

<details> <summary> Discipline Distribution</summary> <div align=center> <img src="assets/second_discipline.png" height="1000"> </div> </details>

 

DocParser: A Cutting-edge Auto-labeling Pipeline

<div align=center> <img src="assets/auto_label_pipeline.png" height="85%"> </div>

Visualizations

<details> <summary> Visual Example One of annotations in DocGenome</summary> <div align=center> <img src="assets/docgenome_label_examples_1.png" height="900"> </div> </details> <details> <summary> Visual Example One of annotations in DocGenome</summary> <div align=center> <img src="assets/docgenome_label_examples_2.png" height="900"> </div> </details> <details> <summary> Visual examples of document-oriented tasks in DocGenome</summary> <div align=center> <img src="assets/docgenome_task_examples.png" height="980"> </div> </details>

Citation

If you find our work useful in your research, please consider citing Fox:

@article{xia2024docgenome,
  title={DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models},
  author={Xia, Renqiu and Mao, Song and Yan, Xiangchao and Zhou, Hongbin and Zhang, Bo and Peng, Haoyang and Pi, Jiahao and Fu, Daocheng and Wu, Wenjie and Ye, Hancheng and others},
  journal={arXiv preprint arXiv:2406.11633},
  year={2024}
}