Home

Awesome

Language Model for Historic Dutch

In this repository we open source a language model for Historic Dutch, trained on the Delpher Corpus, that include digitized texts from Dutch newspapers, ranging from 1618 to 1879.

Changelog

Model Zoo

The following models for Historic Dutch are available on the Hugging Face Model Hub:

Model identifierModel Hub link
dbmdz/bert-base-historic-dutch-casedhere

Stats

The download urls for all archives can be found here.

We then used the awesome alto-tools from @cneud from this repository to extract plain text. The following table shows the size overview per year range:

PeriodExtracted plain text size
1618-1699170MB
1700-1709103MB
1710-171965MB
1720-1729137MB
1730-1739144MB
1740-1749188MB
1750-1759171MB
1760-1769235MB
1770-1779271MB
1780-1789414MB
1790-1799614MB
1800-1809734MB
1810-1819807MB
1820-1829987MB
1830-18391.7GB
1840-18492.2GB
1850-18541.3GB
1855-18591.7GB
1860-18642.0GB
1865-18692.3GB
1870-18741.9GB
1875-1876867MB
1877-18791.9GB

The total training corpus consists of 427,181,269 sentences and 3,509,581,683 tokens (counted via wc), resulting in a total corpus size of 21GB.

The following figure shows an overview of the number of chars per year distribution:

Delpher Corpus Stats

Language Model Pretraining

We use the official BERT implementation using the following command to train the model:

python3 run_pretraining.py --input_file gs://delpher-bert/tfrecords/*.tfrecord \
--output_dir gs://delpher-bert/bert-base-historic-dutch-cased \
--bert_config_file ./config.json \
--max_seq_length=512 \
--max_predictions_per_seq=75 \
--do_train=True \
--train_batch_size=128 \
--num_train_steps=3000000 \
--learning_rate=1e-4 \
--save_checkpoints_steps=100000 \
--keep_checkpoint_max=20 \
--use_tpu=True \
--tpu_name=electra-2 \
--num_tpu_cores=32

We train the model for 3M steps using a total batch size of 128 on a v3-32 TPU. The pretraining loss curve can be seen in the next figure:

Delpher Pretraining Loss Curve

Evaluation

We evaluate our model on the preprocessed Europeana NER dataset for Dutch, that was presented in the "Data Centric Domain Adaptation for Historical Text with OCR Errors" paper.

The data is available in their repository. We perform a hyper-parameter search for:

and report averaged F1-Score over 5 runs with different seeds. We also include hmBERT as baseline model.

Results:

ModelF1-Score (Dev / Test)
hmBERT(82.73) / 81.34
März et al. (2021)- / 84.2
Ours(89.73) / 87.45

License

All models are licensed under MIT.

Acknowledgments

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC) program, previously known as TensorFlow Research Cloud (TFRC). Many thanks for providing access to the TRC ❤️

We thank Clemens Neudecker for maintaining the amazing ALTO tools that were used for parsing the Delpher Corpus XML files.

Thanks to the generous support from the Hugging Face team, it is possible to download both cased and uncased models from their S3 storage 🤗