Awesome
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models
We present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline DocParser. DocGenome features four characteristics:
-
- Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their \LaTeX\ source codes.
-
- Logicality: It provides 6 logical relationships between different entities within each scientific document.
-
- Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA.
-
- Correctness: It undergoes rigorous quality control checks conducted by a specialized team.
Release
- [2024/9/5] π₯ Add the data quality rating for each structured document in DocGenome here
- [2024/8/27] Add the tutorials on how to use the DocGenome dataset.
- [2024/8/7] Add the detalied explanation about the different file structures in DocGenome.Dataset_Details_README
- [2024/7/23] We have supported TestSet downloads from Huggingface. If you want to evaluate your model on TestSet, please refer to Evaluation.
- [2024/7/12] We have supported dataset downloads from Huggingface.
- [2024/6/15] π₯ Our paper entitled "DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models" has been released in arXiv Link
- [2024/6/6] π₯ We have released the DocGenome benchmark, includes 8 subsets as follows:
File Structure
Please refer to Dataset_Details_README for the detalied explanation about the different file structures in DocGenome.
DocGenome Benchmark Introduction
Datasets | # Discipline | # Category of Units | # Pages in Train-set | # Pages in Test-set | # Task | # Used Metric | Publication | Entity Relations |
---|---|---|---|---|---|---|---|---|
DocVQA | - | N/A | 11K | 1K | 1 | 2 | 1960-2000 | β |
DocLayNet | - | 11 | 80K | 8K | 1 | 1 | - | β |
DocBank | - | 13 | 0.45M | 50K | 3 | 1 | 2014-2018 | β |
PubLayNet | - | 5 | 0.34M | 12K | 1 | 1 | - | β |
VRDU | - | 10 | 7K | 3K | 3 | 1 | - | β |
DUDE | - | N/A | 20K | 6K | 3 | 3 | 1860-2022 | β |
D^4LA | - | 27 | 8K | 2K | 1 | 3 | - | β |
Fox Benchmark | - | 5 | N/A (No train-set) | 0.2K | 3 | 5 | - | β |
ArXivCap | 32 | N/A | 6.4M* | N/A | 4 | 3 | - | β |
DocGenome (ours) | 153 | 13 | 6.8M | 9K | 7 | 7 | 2007-2022 | β |
β
ππ»DocGenome-train Download
We provide 8 subsets of DocGenome-train for downloading:
<details> <summary> Data Download</summary>- docgenome-train-000.tar.gz
- docgenome-train-001.tar.gz
- docgenome-train-002.tar.gz
- docgenome-train-003.tar.gz
- docgenome-train-004.tar.gz
- docgenome-train-005.tar.gz
- docgenome-train-006.tar.gz
- docgenome-train-007.tar.gz
Definition of relationships between component units
DocGenome contains 4 level relation types and 2 cite relation types, as shown in the following table:
Name | Description | Example |
---|---|---|
Identical | Two blocks share the same source code. | Cross-column text; Cross-page text. |
Title adjacent | The two titles are adjacent. | (\section{introduction}, \section{method}) |
Subordinate | One block is a subclass of another block. | (\section{introduction}, paragraph within Introduction) |
Non-title adjacent | The two text or equation blocks are adjacent. | (Paragraph 1, Paragraph 2) |
Explicitly-referred | One block refers to another block via footnote, reference, etc. | (As shown in \ref{Fig: 5} ..., Figure 5) |
Implicitly-referred | The caption block refers to the corresponding float environment. | (Table Caption 1, Table 1) |
Attribute of component units
DocGenome has 13 attributes of component units, which can be categorized into two classes
- 1) Fixed-form units, including Text, Title, Abstract, etc., which are characterized by sequential reading and hierarchical relationships readily discernible from the list obtained in Stage-two of the designed DocParser.
- 2) Floating-form units, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \texttt{\textbackslash ref} and \texttt{\textbackslash label}.
Index | Category | Notes |
---|---|---|
0 | Algorithm | |
1 | Caption | Titles of Images, Tables, and Algorithms |
2 | Equation | |
3 | Figure | |
4 | Footnote | |
5 | List | |
7 | Table | |
8 | Text | |
9 | Text-EQ | Text block with inline equations |
10 | Title | Section titles |
12 | PaperTitle | |
13 | Code | |
14 | Abstract |
Types of disciplines
Page distribution of DocGenome. 20% of documents are five pages or fewer, 50% are ten pages or fewer, and 80% are nineteen pages or fewer.
<details> <summary> Page Distribution</summary> <div align=center> <img src="assets/page_distribution.png" height="500"> </div> </details>β
Distribution of secondary disciplines in our DocGenome. The count on the x-axis represents the number of documents, and documents from the same primary discipline are marked with the same color.
<details> <summary> Discipline Distribution</summary> <div align=center> <img src="assets/second_discipline.png" height="1000"> </div> </details>β
DocParser: A Cutting-edge Auto-labeling Pipeline
<div align=center> <img src="assets/auto_label_pipeline.png" height="85%"> </div>Visualizations
<details> <summary> Visual Example One of annotations in DocGenome</summary> <div align=center> <img src="assets/docgenome_label_examples_1.png" height="900"> </div> </details> <details> <summary> Visual Example One of annotations in DocGenome</summary> <div align=center> <img src="assets/docgenome_label_examples_2.png" height="900"> </div> </details> <details> <summary> Visual examples of document-oriented tasks in DocGenome</summary> <div align=center> <img src="assets/docgenome_task_examples.png" height="980"> </div> </details>Citation
If you find our work useful in your research, please consider citing Fox:
@article{xia2024docgenome,
title={DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models},
author={Xia, Renqiu and Mao, Song and Yan, Xiangchao and Zhou, Hongbin and Zhang, Bo and Peng, Haoyang and Pi, Jiahao and Fu, Daocheng and Wu, Wenjie and Ye, Hancheng and others},
journal={arXiv preprint arXiv:2406.11633},
year={2024}
}