Home

Awesome

TRANSLIT: A Large Name Transliteration Resource

TRANSLIT is A Large Name Transliteration Resource. If you find this code useful in your research, please consider citing:

@inproceedings{benitesLREC2020,
Author = {Fernando Benites, Gilbert François Duivesteijn, Pius von Däniken, Mark Cieliebak}
Title = {Large Name Transliteration Resource},
booktitle = {Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2020)},
Year = {2020},
}

We merged together sources that now encompasses 3 Millions surfaces (names) of around 1.6 Million entities

We merged four data sources:

  1. JRC named entities
  2. Amazon Wiki-Names
  3. Google En-Ar transliterations
  4. Geonames

We also searched for lang tags of wikipedia for transliterations (wiki-all).

We merged multiple names of an entity and assigned a UUID to it. We saved all the gathered names/entities in the file TRANSLIT.json, in the artefacts directory.

Dataset# entities# name variationsmean length of chars per name
JRC819'2091'338'46314.3
Geonames139'549758'27410.6
SubWikiLang609'4201'376'44610.3
En-Ar15'85831'7164.4
Wiki-lang-all122'180144'58817.0
TRANSLIT (all)1'655'9723'008'23911.8

Experiments

The experiments of the paper can be retraced with the use of the scripts abalation_study.py, classification_experiments.py and cnn_classification.py in the code directory. For their use, the data in artefact is used. To recreate this data, you need to download the original data (17G zipped) with download_data.sh. Afterward you should run run_preprare_data.sh.

Troubleshooting

the artefacts are quite large, so git lfs needs to be installed: $ sudo apt install git-lfs $ git lfs install --local $ git lfs fetch