Home

Awesome

german_transliterate

german_transliterate is a Python module to clean and transliterate (i.e. normalize) German text including abbreviations, numbers, timestamps etc. It can be used to clean messy text (e.g. map peculiar Unicode encodings to ASCII) or replace common abbreviations in text in combination with various text mining tasks.

However, it is particularly useful for Text-To-Speech (TTS) preprocessing (both in training and inference) and has features to support phonemic encoding of the results (e.g. with espeak-ng) afterwards as next step in the processing pipeline.

Is has been successfully applied to preprocessing with Mozilla TTS in combination with espeak-ng phonemes as input data to both training and inference pipeline.

License and Attribution

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

To provide attribution or cite this work please use the following text snippet:

german_transliterate, Copyright 2020 by repodiac, see https://github.com/repodiac for updates and further information

Version History

Installation/Setup

It has currently only one external dependency, num2words. All dependencies are to be found in requirements.txt and included in setup.py as well, at the moment.

Installation is easy using pip and built-in git package installation based on setup.py:

Setup:

Documentation

Example Usage

In Python code or as library:

from german_transliterate.core import GermanTransliterate

text = 'Um 13:15h kaufte Hr. Meier (Mitarbeiter der Firma ABC) 1.000 Luftballons für 250€.'
print('ORIGINAL:', text, '\n')

ops = {'acronym_phoneme', 'accent_peculiarity', 'amount_money', 'date', 'timestamp',
        'weekday', 'month', 'time_of_day', 'ordinal', 'special', 'math_symbol', 'spoken_symbol'}

# use these setting for PHONEMIC ENCODINGS as input (e.g. with TTS)
print('TRANSLITERATION with phonemic encodings:',
      GermanTransliterate(replace={';': ',', ':': ' '}, sep_abbreviation=' -- ').transliterate(text), '\n')

# use none or your own for other purposes than phonemic encoding and do not use 'spoken_symbol' or 'acronym_phoneme'
print('TRANSLITERATION (default):',
      GermanTransliterate(transliterate_ops=list(ops-{'spoken_symbol', 'acronym_phoneme'})).transliterate(text), '\n')

NEW From command-line (in the shell):

python core.py '1, 2, 3 - alles ist dabei'

Input Parameters

There is currently only one method to be used: transliterate('Das ist der Text.')

It has the following input parameters:

The parameters used for the config parameter transliterate_ops are as follows:

Performance

The current state is mainly based on using manual mappings and regular expressions for substitution and expansion of strings (words or terms). Therefore, current performance should be good enough to be used with online inference or "realtime" usage in a text processing pipeline. As further modules or ops are added over time, there might be also rather slow methods doing heavy computations and thus suited mainly for training or offline processing.

Issues and Comments

Please open issues on github for bugs or feature requests. You can also reach out to me via email.