Home

Awesome

Tokenizer Evaluation

This repository contains benchmark scripts for comparing different tokenizers and sentence segmenters of German. For trouble-free testing, all tools are provided in a Dockerfile.

This work was presented at EURALEX 2022. The paper is published for open access. Please cite as:

Diewald, N./Kupietz, M./Lüngen, H. (2022): Tokenizing on scale - Preprocessing large text corpora on the lexical and sentence level. In: Klosa-Kückelhaus, A./Engelberg, S./Möhrs, C./Storjohann, P. (eds.): Dictionaries and Society. Proceedings of the XX EURALEX International Congress. IDS-Verlag, Mannheim, Germany, pp. 208-221.

Creating the container

To build the Docker image, run

$ docker build -f Dockerfile -t korap/tokenbench .

This will create and install an image of approximately 12GB.

Running the evaluation suite

To run the benchmark, call

$ docker run --rm -i \
  -v ${PWD}/benchmarks:/tokenbench/benchmarks \
  -v ${PWD}/corpus:/tokenbench/corpus \
  korap/tokenbench benchmarks/[BENCHMARK-SCRIPT]

The supported benchmark scripts are:

benchmark.pl

Performance measurements of the tools. For the benchmarking, the novel "Effi Briest" by Theodor Fontane in the Project Gutenberg version was used (with a total of 98,207 tokens according to wc -l). See the tools section for some remarks to take into account. Accepts two numerical parameters:

benchmark_batches.pl

Performance measurements of the tools. See the tools section for some remarks to take into account. Accepts one numerical parameter:

Will check batches of 1000, 2000, 4000, 8000 ... 8192000 tokens against all tools.

empirist.pl

To run the empirist evaluation suite, you first need to download the empirist gold standard corpus and tooling, and extract it into the corpus directory.

$ wget https://sites.google.com/site/empirist2015/home/shared-task-data/empirist_gold_cmc.zip
$ unzip empirist_gold_cmc.zip -d corpus

$ wget https://sites.google.com/site/empirist2015/home/shared-task-data/empirist_gold_web.zip
$ unzip empirist_gold_web.zip -d corpus

To investigate the output, start the benchmark with mounted output folders

-v ${PWD}/output_cmc:/tokenbench/empirist_cmc
-v ${PWD}/output_web:/tokenbench/empirist_web

ud_tokens.pl

To run the token evaluation suite against the Universal Dependency corpus, first install the empirist tooling as explained above, and download the corpus.

$ wget https://github.com/UniversalDependencies/UD_German-GSD/raw/master/de_gsd-ud-train.conllu \
  -O corpus/de_gsd-ud-train.conllu

ud_sentences.pl

To run the sentence evaluation suite, first download the corpus as explained above.

Caveat

When running this benchmark using Docker you may need to run all processes privileged to get meaningful results.

docker run --privileged -v

Tools

Our tools for token and sentence boundary detection:

Tools for token and sentence boundary detection:

Tools for token boundary detection only:

Tools for sentence boundary detection only:

Results

Overview of all compared tools and models with their performance measures.

In terms of speed, the native output of the tools was measured, while in terms of accuracy, further reshaping was necessary to make it comparable to the gold standard. See the tools section for further caveats.

The measures correspond to the average value of 100 runs of benchmark.pl. Since the length of a text can have an impact on performance, a tenfold concatenation of the text was also tested. The test system was an Intel Xeon CPU E5-2630 v2 @ 2.60GHz with 12 cores and 64 GB of RAM

ToolV.ModelUD-GSD (Tokens) F1Empirist-CMC F1Empirist-Web F1UD-GSD (Sentences) F11 x Effi (T/ms)10 x Effi (T/ms)
KorAP-Tokenizer2.2.299.4599.0699.2796.8772.90199.28
Datok0.1.5datok99.4598.7999.2197.60614.722304.13
""matok""""1041.632798.78
BlingFire0.1.8wbd.bin99.2555.8595.80-431.921697.73
""sbd.bin---95.90417.101908.87
Cutter2.599.4796.2499.3897.310.38*-
JTok2.1.1999.5658.4498.0997.9231.19117.22
OpenNLP1.9.4Simple95.7055.2691.69-290.711330.23
""Tokenizer (de-ud-gsd)99.6765.2297.58-74.65145.08
""SentenceDetector (de-ud-gsd)---98.51247.84853.01
SoMaJo2.2.0p=199.4699.2199.8797.058.158.41
""p=8""""27.3239.91
SpaCy3.2.3Tokenizer99.4969.9498.29-19.7344.40
""Sentencizer---96.8016.9440.58
""Statistical---97.164.9010.01
""Dependency---96.932.240.48
Stanford4.4.0tokenize99.9397.7198.46-75.47156.24
""tokenize,split,mwt"""98.2246.9591.56
Syntok1.4.3Tokenizer99.4170.7697.50-103.90108.40
""Segmenter---97.5059.6661.07
Waste2.0.20-199.5565.9098.4997.46141.07144.95
Elephant0.2.399.6266.9697.88-8.578.68
TreeTagger3.2.499.5295.5899.27-69.9272.98
Deep-EOS0.1bi-lstm-de---97.47**0.25**0.24
""cnn-de---97.49**0.27**0.25
""lstm-de---97.47**0.29**0.27
NNSplit0.5.8---95.55**0.90**0.90

* Did not finish on the test machine.

** No GPU acceleration tested.

Result chart

Tokenizer performance chart

Literature