Awesome
FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models
This repository contains:
- training and evaluation data from allographetic transcriptions of various Old French and Old Occitan manuscripts, in various states of correctness, in Kraken training format;
- HTR models trained and tested using this data.
If you plan of using this data or the provided model for a publication, please cite it, as:
Jean-Baptiste Camps (éd.), FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models, Paris: École nationale des chartes (PSL), 2018, https://github.com/Jean-Baptiste-Camps/FROC-MSS.
Data format
The data is as following:
- each line image in a
.png
file; - each transcription in a
.gt.txt
file.
Unicode NFD normalisation has been applied on the ground-truth text.
Models
Summary and C.E.R.
The root folder contains a vanilla Kraken model (model_froc.mlmodel
),
trained with default settings and without any additional data (e.g. no artificial noised data).
Data was randomly divided in 80% for training (train.txt
), 10% for in-training
validation (val.txt
) and 10% for final testing of the model (test.txt
).
It achieved a C.E.R. of:
- ** 8.11 % ** on validation data (7.03% ignoring spaces);
- ** 7.83 % ** on test data (6.92% ignoring spaces).
Errors and most frequent confusions on test data
There were 13540 characters and 1061 errors on test data.
Globally, the error are as follow:
- 536 characters from the ground truth were not predicted by the model;
- 132 characters absent from the ground truth were wrongly predicted;
- 393 character substitutions.
The most frequent confusions concerned spacing.
The 20 most frequent confusions are:
Errors Ground Truth-Prediction
70 { SPACE } - { }
54 { } - { SPACE }
48 { ı } - { }
43 { n } - { }
43 { COMBINING ACUTE ACCENT } - { }
27 { e } - { }
24 { l } - { }
24 { u } - { }
21 { . } - { }
20 { u } - { n }
18 { ſ } - { }
18 { a } - { }
17 { r } - { }
14 { t } - { }
13 { COMBINING TILDE } - { }
13 { } - { ı }
12 { o } - { e }
12 { o } - { }
12 { ı } - { m }
11 { e } - { c }
List of manuscripts
The data comes from partial allographetic transcription of the following mss:
- Clermont-Ferrand, archives départementales, 1F2 (XIII 1/3, anglo-norman praegothica script; Chanson d'Aspremont); 52 lines.
- Paris, Bibliothèque nationale de France, fr. 854 (XIII 4/4, Venice or Venetian area; gothic textualis; occitan chansonnier I); 1112 lines.
- Cologny-Genève, fondation Martin-Bodmer, cod. Bodm. 168 (XIII 3/3, anglo-norman gothic textualis; Chanson d'Otinel); 1908 lines.
- Oxford, Bodleian Library, Digby 23 (XII 1/2, anglo-norman praegothica; Chanson de Roland); 564 lines.
For these transcriptions, see: Jean-Baptiste Camps, La `Chanson d’Otinel’: édition complète du corpus manuscrit et prolégomènes à l’édition critique, PhD thesis, dir. Dominique Boutet, Paris-Sorbonne, 2016, DOI: https://doi.org/10.5281/zenodo.1116735.
<!-- TODO: à compléter avec les autres manuscrits: Vatican, Mende, … -->License
<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Licence Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />Cette œuvre est mise à disposition selon les termes de la <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Licence Creative Commons Attribution 4.0 International</a>.
Contribute
If you want to contribute training data or models, you can do so by cloning the repository and sending us a pull request, or by sending an email at jbcamps at hotmail.com .
Cite this repository
Jean-Baptiste Camps (éd.), FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models, Paris: École nationale des chartes (PSL), 2018, https://github.com/Jean-Baptiste-Camps/FROC-MSS.