Home

Awesome

🤗 + ⚛️ Fine-tuned Transformers compatible BERT models for Sequence Tagging

This repository contains fine-tuned Transformers compatible BERT models for sequence tagging tasks like NER or PoS tagging.

We use the (coming soon) fine-tuning NER example from the awesome Transformers repository.

Changelog

CoNLL datasets

For NER tasks, the CoNLL 2002 and 2003 datasets are used. The following languages are covered: German, English, Dutch and Spanish.

These models are trained with a max. sequence length of 128.

❔ But what happens, if a sentence in the dataset is longer than 128 subtokens? We simply split longer sentences into smaller ones. We do not ✂️ sentences, which would be bad for evaluation!

❔ What is current state-of-the-art for fine-tuned models? Well, we mainly compare our models to the following two papers:

❔ We use a max. sequence length of 128 and a batch size of 32. These are the same parameters as used by Pires, Schlinger and Garrette (2019).

Results

We report an averaged F1-score of 5 different runs. We use different seeds for each run. The official CoNLL evaluation script from here is used.

English

ModelRun 1Run 2Run 3Run 4Run 5Avg.
BERT base, cased (Dev)95.1395.2995.0795.1295.5395.23
BERT base, cased (Test)90.8990.7690.8291.0991.6091.03
BERT large, cased (Dev)95.6995.4795.7795.8695.9195.74
BERT large, cased (Test)91.7391.1791.7791.2291.4691.47

Pires, Schlinger and Garrette (2019) report a F1-score of 91.07% using the EN-BERT model.

German

Notice: For this experiment we use the original CoNLL-2003 data. The dataset also includes an 2006 update that fixes various MISC annotations. However, it turns out that most papers are not using the updated 2006 version of the dataset. So in this experiment, we use the "old" dataset in order to achieve a better comparison to other papers. See a more detailed discussion about this dataset in here.

We use three BERT models:

ModelRun 1Run 2Run 3Run 4Run 5Avg.
mBERT base, cased (Dev)86.0485.6486.0485.1083.1685.20
mBERT base, cased (Test)83.0282.8082.5682.2182.1482.55
German BERT base, cased (Dev)87.3386.3487.0586.5286.8086.81
German BERT base, cased (Test)83.8483.9883.6283.7083.3783.70
Own German BERT base, cased (Dev)86.6686.7587.0686.6187.2286.86
Own German BERT base, cased (Test)84.3284.4784.7684.3884.6884.52
German DistilBERT (v0), cased (Dev)86.5886.4386.1986.6086.4486.45
German DistilBERT (v0), cased (Test)83.3383.1183.3083.5584.1883.49

Pires, Schlinger and Garrette (2019) report a F1-score of 82.00% for their fine-tuned multilingual BERT model, whereas Wu and Dredze (2019) report a F1-score of 82.82%.

For German DistilBERT a learning rate of 6e-5 was used.

Dutch

ModelRun 1Run 2Run 3Run 4Run 5Avg.
mBERT base, cased (Dev)90.7391.1190.8290.9491.0590.93
mBERT base, cased (Test)90.9490.2990.1090.3290.2790.38

Pires, Schlinger and Garrette (2019) report a F1-score of 89.86% for their fine-tuned multilingual BERT model, whereas Wu and Dredze (2019) report a F1-score of 90.94%.

Spanish

ModelRun 1Run 2Run 3Run 4Run 5Avg.
mBERT base, cased (Dev)86.5086.2586.4486.9986.7086.576
mBERT base, cased (Test)87.8087.9387.9287.0087.5387.636

Pires, Schlinger and Garrette (2019) report a F1-score of 87.18% for their fine-tuned multilingual BERT model, whereas Wu and Dredze (2019) report a F1-score of 87.38%.

GermEval 2014

We train three models for the GermEval 2014 shared task: mBERT, German BERT and our own German BERT model.

Notice: the original dataset includes some strange character control sequences (all labeled with "O" tag). We remove them in a pre-processing step (for train/dev/test sets).

Results

We report averaged F1-score over 5 different runs (with different seeds). The official evaluation script is used for evaluation. We report the Strict, Combined Evaluation (official) metric here.

ModelRun 1Run 2Run 3Run 4Run 5Avg.
mBERT base, cased (Dev)86.9787.0486.6687.1186.5386.86
mBERT base, cased (Test)85.9086.3786.4786.5686.0086.26
German BERT base, cased (Dev)87.3687.0387.5587.5387.2387.34
German BERT base, cased (Test)86.3586.9386.7186.8586.2386.61
Own German BERT base, cased (Dev)87.7487.787.7787.9688.5287.94
Own German BERT base, cased (Test)86.9686.8587.0186.8986.7386.89
German DistilBERT (v0), cased (Dev)86.1586.4186.0086.3085.9986.17
German DistilBERT (v0), cased (Test)85.2585.3485.2785.2185.0885.23

For German DistilBERT a learning rate of 6e-5 was used.

Universal Depedencies

German HDT

We train three models (mBERT, German BERT and our own German BERT model) on the recently released German HDT Universal Dependencies corpus. It contains over 200K annotated sentences, resulting in one of the largest UD corpora. We use the latest data from the dev branch on the German HDT repository.

Results

We report averaged accuracy over 5 different runs (with different seeds).

ModelRun 1Run 2Run 3Run 4Run 5Avg.
mBERT base, cased (Dev)98.3698.3598.3698.3698.3698.35
mBERT base, cased (Test)98.5898.5798.5898.5898.5898.58
German BERT base, cased (Dev)98.3798.3798.3798.3998.3598.37
German BERT base, cased (Test)98.5798.5598.5798.5498.5798.56
Own German BERT base, cased (Dev)98.3898.3898.3998.3798.3698.38
Own German BERT base, cased (Test)98.5798.5698.5698.5798.5898.57

Italian

We've trained BERT models from scratch for Italian (both cased and uncased).

WikiNER

The WikiNER dataset from here for Italian is pre-processed into a 80/10/10 (training, development, test) split. We train for 10 epochs using the default parameters from the run_ner.py script.

Results

ModelRun 1Run 2Run 3Run 4Run 5Avg.
mBERT base, cased (Dev)93.3593.2993.3293.2893.3693.32
mBERT base, cased (Test)93.3993.6593.6393.4493.5493.53
ItBERT base, cased (Dev)93.3793.2693.4393.2393.1493.29
ItBERT base, cased (Test)93.4193.4693.3893.5393.5693.47
ItBERT XXL base, cased (Dev)93.2693.4193.2293.3293.2893.30
ItBERT XXL base, cased (Test)93.4793.6693.6693.6593.6293.61

EVALITA 2009

We use the NER data provided by the EVALITA 2009 shared task. We train for 20 epochs using the default parameters from the run_ner.py script.

Results

ModelRun 1Run 2Run 3Run 4Run 5Avg.
mBERT base, cased (Test)84.9385.5285.3385.1684.9885.18
ItBERT base, cased (Test)85.8885.9685.2785.6086.0885.76
ItBERT XXL base, cased (Test)88.4188.0688.3887.7088.0988.13

PoSTWITA

We use the dataset for the PoSTWITA shared task to report results for PoS tagging (Twitter). The dataset is used from this repository. We report results both for cased and uncased BERT models. We train for 20 epochs using the default parameters from the run_ner.py script.

ModelRun 1Run 2Run 3Run 4Run 5Avg.
mBERT base, cased (Test)91.4791.4791.6891.5291.5491.54
ItBERT base, cased (Test)93.7593.3893.8393.8693.5493.67
ItBERT XXL base, cased (Test)93.7993.8293.5693.7893.7993.75
mBERT base, uncased (Test)91.7591.9791.5691.9891.9791.85
ItBERT base, uncased (Test)93.4893.7393.9893.4693.3593.60
ItBERT XXL base, uncased (Test)93.6893.5193.8393.8193.5193.67

ToDo