Awesome

GROBID-Dictionaries

Purpose

GROBID-Dictionaries is a GROBID module, implementing a java machine learning library, for structuring digitised lexical resources and entry-based documents with encyclopedic or bibliographic content. It allows the parsing, extraction and structuring of text information in such resources.

Approach

GROBID-Dictionaries is based on cascading models. The diagram below presents the architecture enabling the processing and the transfer of the text information through the models.

GROBID Dictionaries Structure

Dictionary Segmentation This is the first model and has as goal the segmentation of each dictionary page into 3 main blocks: Headnote, Body and Footnote. Another block, "dictScarp" could be generated for text information that do not belong to the principal blocks

Dictionary Body Segmentation The second model gets the Body, recognised by the first model, and processes it to recognise the boundaries of each lexical entry.

Lexical Entry The third model parses each lexical entry, recognised by the second model, to segment it into 4 main blocks: Form, Etymology, Senses, Related Entries. A "dictScrap" block is there as well for unrecognised information.

The rest of the models The same logic respectively applies for the recognised blocks in a lexical entry by having a specific model for each one of them

N.B: The current architecture could change at any milestone of the project, as soon as new ideas or technical constraints emerge.

Input/Output

GROBID-Dictionaries takes as input a file in PDF or ALTO formats. Each model of the aforementioned components generates a TEI P5-encoded hierarchy of the different recognised text structures at that specific cascading level. The final serialised output is in-line with new version of LMF (Romary et al. 2019) and the TEI-Lex-0 initiative (Romary and Tasovac 2019).

Demo

The most recent version of the system is available online. The models of this version are trained with samples from 5 different dictionaries that you can download and parse with GROBID-Dictionaries. This video illustrates a use case of different models of the system.

Docker Use

To shortcut the installation of the tool, the Docker manual can be followed to use the latest image of the tool.

To Cite

Mohamed Khemakhem, Luca Foppiano, Laurent Romary. Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields. electronic lexicography, eLex 2017, Sep 2017, Leiden, Netherlands. hal-01508868v2

Mohamed Khemakhem, Axel Herold, Laurent Romary. Enhancing Usability for Automatically Structuring Digitised Dictionaries. GLOBALEX workshop at LREC 2018, May 2018, Miyazaki, Japan. 2018. hal-01708137v2

Documentation

For more expert and development usage, the documentation of the tool is detailed here

Contact

Mohamed Khemakhem (mohamed.khemakhem@inria.fr), Laurent Romary (laurent.romary@inria.fr)