Home

Awesome

Data Centric Domain Adaptation for Historical Text with OCR Errors

This repository contains code and datasets that are used in our paper "Data Centric Domain Adaptation for Historical Text with OCR Errors" by Luisa März, Stefan Schweter, Nina Poerner, Benjamin Roth and Hinrich Schütze. The publicly accessible preprint can be found here.

Changelog

Datasets

The data used for our experiments can be found in the data folder of this repository.

Stats

The following table shows an overview of the corpus stats for each language:

LanguageTraining SentencesDevelopment SentencesTest Sentences
French7,936992992
Dutch5,777722723

These stats can be calculated with the flair_stats.py script using Flair (commit: 7578403).

Code

Code for training our models will be released in near future.

Usage in Flair

With latest Flair master branch, native support for our released datasets was added. It is possible to load our datasets with the following lines of code:

from flair.datasets import NER_ICDAR_EUROPEANA

french_corpus = NER_ICDAR_EUROPEANA(language="fr")
dutch_corpus  = NER_ICDAR_EUROPEANA(language="nl")

License

We release the data under CC0 1.0 Universal (CC0 1.0) license (Same license as used for Europeana NER Corpora).

Citation

You can use the following BibTeX entry for citing our paper/data:

@InProceedings{10.1007/978-3-030-86331-9_48,
    author="M{\"a}rz, Luisa
    and Schweter, Stefan
    and Poerner, Nina
    and Roth, Benjamin
    and Sch{\"u}tze, Hinrich",
    editor="Llad{\'o}s, Josep
    and Lopresti, Daniel
    and Uchida, Seiichi",
    title="Data Centric Domain Adaptation for Historical Text with OCR Errors",
    booktitle="Document Analysis and Recognition -- ICDAR 2021",
    year="2021",
    publisher="Springer International Publishing",
    address="Cham",
    pages="748--761",
    abstract="We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.",
    isbn="978-3-030-86331-9"
}