

ICDAR 2021 Competition on Scientific Literature Parsing

Scientific literature contain important information related to cutting-edge innovations in diverse domains. Advances in natural language processing have been driving the fast development in automated information extraction from scientific literature. But scientific literature are mainly in unstructured PDF format. But while PDF is great for preserving the basic elements (characters, lines, shapes, images, etc.) on a canvas for different operating systems or devices for humans to consume, it’s not a format that machines can understand.

A critical challenge for automated information extraction from scientific literature is that the documents often contain content that is not in natural language, such as figures and tables. Nevertheless, such content usually illustrates key results, messages, or summarizations of the research. To obtain a comprehensive understanding of scientific literature, the automated system must be able to recognize the layout of the documents and parse the non-natural-language content into a machine readable format. This competition aims to drive the advances in the following two problems:

Task A: Document layout recognition

Task B: Table recognition

Terms and conditions