Home

Awesome

Language Machines Badge Build Status GitHub release (latest by date)

Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.

Nederlab Pipeline

Introduction

This repository contains the NLP pipeline for the linguistic enrichment of historical dutch, as developed in the scope of the Nederlab project. This repository covers only the pipeline logic, powered by Nextflow, not the individual components. It depends on the following tools:

Format

All tools in this pipeline take and produce documents in the FoLiA XML format (version 2). Provenance information of all the tools is recorded in the documents themselves. Please take note of the FoLiA Guidelines if you work with this pipeline or any documents produced by it.

The following linguistic enrichments can be performed, note that different FoLiA (tag)sets can be produced, even at the same time, based on what methodology was choosen and what time period the document covers:

In addition to the linguistic annotations, the tei2folia converter produces a wide variety of structural annotations and also markup annotations, as it's objective is to retain all information from the original TEI source.

Changes from older versions

As there are documents produced with previous versions of this pipeline, it is important to be aware of the biggest changes:

This pipeline itself used to be part of PICCL, but was split-off for maintainability and clarity.

Installation

The pipeline and all components on which it depends is shipped as a part of LaMachine, which comes in various flavours (Virtual Machine, Docker container, local installation, etc..).

Usage

Inside LaMachine, you can invoke the workflow as follows:

$ nederlab.nf

or:

$ nextflow run $(which nederlab.nf)

For instructions, run nederlab.nf --help.

You can also let Nextflow manage Docker and LaMachine for you, but we won't go into that here.

Fix and split pipeline

There was a problem with the DBNL collection as delivered in 2019 (described in internal issue TT-709). Also, it was decided that it was better to split the independent titles after all. A Nextflow script has been written to handle this.

Put the collection you want to process in some input directory, create an output directory, and run something like:

$ dbnl_fix_and_split.nf --inputdir input/ --outputdir output/ --datadir /path/to/nederlab-linguistic-enrichment

The data directory should point to where you checked out the nederlab-linguistic-enrichment repository (a private repository by INT).

Note: pass --extension folia.xml.gz if the input files are compressed. The script will compress all output files by default too.

Resources

Resources for Erik Tjong Kim Sang's modernisation method are included in this repository:

Not included is the INT Historical Lexicon, as it is copyrighted material.