Home

Awesome

Open source speech recognition recipe and corpus for building German acoustic models with Kaldi

This recipe and collection of scripts enables you to train large vocabulary German acoustic models for speaker-independent automatic speech recognition (ASR) with Kaldi. The scripts currently use three freely available German speech corpora: The Tuda-De corpus is recorded with a Microsoft Kinect and two other microphones in parallel at Technische Universität Darmstadt and has been released under a permissive license (CC-BY 4.0). This corpus compromises ~31h of training data per microphone and ~5h separated into development and test partitions. We also make use of the German subset from the Spoken Wikipedia Corpora (SWC), containing about 285h of additional data and the German subset of m-ailabs read speech data corpus (mirror) (237h). Recently we also added the German Commonvoice corpus from Mozilla (https://commonvoice.mozilla.org/de) with 370h of data. We use the test/dev sets from Tuda-De for WER evaluations.

The newest recipe (s5_r2) trains and tests on data from multiple microphones by default (all but Realtek - about 127h of audio in total). By editing run.sh you can also restrict it to a single microphone (e.g. only Kinect). It also trains on SWC data and M-ailabs by default, too, resulting in 630h of speech data in total after cleaning. See our paper for more information and WER results. More recent results are in the table in the pretrained models section.

The old s5 recipe used in our previous paper is also still available and trained only on the beamformed data of the Kinect microphone, checkout the README.md in the s5 directory if you want to reproduce the results of our old paper.

The scripts will ask you where to place larger files and can download all necessary files (speech corpus, German texts, phoneme dictionaries) to train the acoustic and language models. You can also download these resources manually, see Section "Getting data files separately" down below.

If you use our data, models or scripts in your academic work please cite our paper!

News

19 April 2023

19 April 2022

4 April 2022

6 April 2021

2 July 2020

12 June 2020

5 March 2019

21 August 2018

15 August 2018

26 July 2018

26 June 2018

31 May 2018

30 May 2018

02 May 2018

25 April 2018

Newest pretrained models

Acoustic model + FSTTraining dataTuda dev WERTuda test WER
tuda_swc_mailabs_cv8_voc900k (FST)1700h (tuda+SWC+m-ailabs+cv8)9.3010.17
+ lm_v6_voc900k const arpa rescoring140 million sentences7.237.96
+ rnn_lmv6_lstm4x_voc900k rnnlm rescoring140 million sentences6.196.93

All results above are with number reformating, e.g. drei und sechzig -> dreiundsechzig. If you need an overall faster model, you can also replace the default HCLG with this much smaller one: HCLG_s. Together with RNNLM rescoring, the WER result will only be minimally bigger.

Previous pretrained models

Acoustic model + FSTTraining dataTuda dev WER (FST)Tuda test WER (FST)
tuda_swc_voc126k / mirror375h (tuda+SWC)20.3021.43
tuda_swc_voc350k / mirror375h (tuda+SWC)15.3216.49
tuda_swc_mailabs_voc400k / mirror630h (tuda+SWC+m-ailabs)14.7815.87
tuda_swc_mailabs_cv_voc683k_smaller_fst1000h (tuda+SWC+m-ailabs+cv)12.6914.29
+ lm_v5_voc683k_smaller_fst const arpa rescoring100 million sentences10.9212.37
+ reformat numberse.g. drei und sechzig -> dreiundsechzig8.9410.26
tuda_swc_mailabs_cv3_voc683k1000h (tuda+SWC+m-ailabs+cv3)12.2613.79
+ lm_v5_voc683k const arpa rescoring100 million sentences10.4711.85
+ reformat numberse.g. drei und sechzig -> dreiundsechzig8.619.85
tuda_swc_mailabs_cv8_voc722k1700h (tuda+SWC+m-ailabs+cv8)10.9412.09
+ lm_v5_voc722k const arpa rescoring100 million sentences9.2510.17
+ reformat numberse.g. drei und sechzig -> dreiundsechzig7.518.53
+ rnn_lm_lstm2x_voc722k rnnlm rescoring100 million sentences6.517.43

New: We have now added results for rescoring as well, which improves the FST decoding results further as expected. This includes both const arpa models as well as RNN LMs, see also our paper.

We have developed a PyKaldi based solution to use the models with either a local microphone or network streaming in real time: https://github.com/uhh-lt/kaldi-model-server

For batch decoding of media files, we developped subtitle2go to automatically generate German subtitles.

Another option to use the models is the Kaldi gstreamer server project. You can either stream audio and do online (real-time) recogniton with it or send wav files via http and get a JSON result back. See also the Kaldi + Gstreamer Server Software installation guide here. There is a run_tuda_de.sh in the package that starts Kaldi gstreamer workers for tuda_de. You will need to modify the KALDI_ROOT variable in the script so that it finds your Kaldi installation properly.

Training your own models

If you want to adapt our models (add training data, augment training data, change vocabulary, ...), you will need to retrain our models. A workstation or server with more than 64GB memory might be needed, having access to a lot of CPU cores is recommended and a recent Nvidia GPU is needed to train neural models such as the TDNN-HMM.

Prerequisites

Clone the repository with the submodule:

git clone --recurse-submodules https://github.com/uhh-lt/kaldi-tuda-de

The scripts are only tested under Linux (Ubuntu 16.04 - 20.04). Install at first some mandatory packages:

sudo apt install sox libsox-fmt-all

Download and install Kaldi and follow the installation instructions. You can download a recent version using git:

 git clone https://github.com/kaldi-asr/kaldi.git kaldi-trunk --origin golden

In Kaldi trunk:

  1. go to tools/ and follow INSTALL instructions there.

  2. Install a BLAS library. This can be Intel MKL, OpenBLAS or Atlas.

If you have an Intel CPU the easist and now recommended library is to install Intel MKL. You can install it easily on Debian/Ubuntu by running extras/install_mkl.sh. You can then skip the rest of this section and go to section 3.

Cross platform solution: Download and install OpenBLAS, build a non-multithreading (important!) library with:

make USE_THREAD=0 USE_LOCKING=1 FC=gfortran

Now follow the displayed instructions to install OpenBLAS headers and libs to a new and empty directory.

Warning! It is imperative to build a single threaded OpenBLAS library, otherwise you will encounter hard to debug problems with Kaldi as Kaldis parallelization interferes with the OpenBLAS one.

  1. go to src/ and follow INSTALL instructions there. Intel MKL is found automatically, if you installed with OpenBLAS point the configure script to your OpenBLAS installation (see ./configure --help).

Our scripts are meant to be placed into its own directory in KALDIs egs/ directory. This is also where all the other recipes reside in. If you want to build DNN models, you probably want to enable CUDA in KALDI with the configure script in src/. You should have a relatively recent Nvidia GPU, at least one with the Kepler architecture.

You also need Sequitur G2P (https://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html, https://github.com/sequitur-g2p/sequitur-g2p). Download the package and run make, then edit the sequitur_g2p variable in s5_r2/cmd.sh to point to the g2p.py script.

You will also need a recent version of Python 3. Package requirements are:

pip3 install beautifulsoup4 lxml spacy && python -m spacy download de_core_news_lg

Additinally, the requests package was previously used to communicate with MaryTTS to generate phonemizations, however you won't need it if you run the standard setup.

Get LM text data

See https://github.com/bmilde/german-asr-lm-tools/ for instructions on getting recent German text data normalized. Place the resulting gzipped file in ${lm_dir}/cleaned_lm_text.gz, with the defaults: data/local/lm_std_big_v6/cleaned_lm_text.gz If you forget this step, the run.sh script will complain about a missing LM text file. Warning: The default vocabulary file local/voc_800k.txt may give suboptimal WER results, if you pair it with your own crawled data, so make sure to replace it with your own vocabulary file.

Building the acoustic models

After you have installed the prerequisites, edit cmd.sh in the s5_r2/ directory of this distribution to adjust for the number of processors you have locally (change nJobs and nDecodeJobs accordingly). You could probably also uncomment the cluster configuration and run the scripts on a cluster, but this is untested and may require some tinkering to get it running.

Then, simply run ./run.sh in s5_r2/ to build the acoustic and language models. The script will ask you where to place larger files (feature vectors and KALDI models) and automatically build appropriate symlinks. Kaldi_lm is automatically downloaded and compiled if it is not found on your system and standard Kneser-Ney is used for a 4-gram LM.

Getting data files separately

You can of course also use and download our data resources separately.

Speech corpus

The corpus can be downloaded here. The license is CC-BY 4.0. The run.sh script expects to find the corpus data extracted in data/wav/ and will download it for you automatically, if it does not find the data.

Newer recipes also make use of SWC data.

German language texts

See https://github.com/bmilde/german-asr-lm-tools/ for new instructions to obtain a large amount of German LM text data.

German phoneme dictionary

The phoneme dictionary is currently not supplied with this distribution, but the scripts to generate them are. DFKIs MARY includes a nice LGPL German phoneme dictionary with ~26k entries. Other sources for phoneme dictionary entries can be found at BAS. Our parser understands the different formats of VM.German.Wordforms, RVG1_read.lex, RVG1_trl.lex and LEXICON.TBL. The final dictionary covers ~44.8k unique German words with 70k entries total (pronunciation variants). Since the licensing of the BAS dictionaries is unclear, they are not included into the phoneme dictionary by default. You can however enable them by editing the header of run.sh and setting use_BAS_dictionaries to true.

build_big_lexicon.py can import many dictionaries in the BasSAMPA format and merge them into a single dictionary. Its parser understand many variants and dialects of BasSAMPA and the adhoc dictionary formats. To support new variants you'll have to edit def guessImportFunc(filename). The output is a serialised python object.

export_lexicon.py will export such a serialised python dictionary into KALDIs lexion_p.txt format (this allows to model different phonetic realisations of the same word with probabilities). Stress markers in the phoneme set are grouped with their unstressed equivalents in KALDI using the extra_questions.txt file. It is also possible to generate a CMU Sphinx formated dictionary with the same data using the -spx option. The Sphinx format also allows pronunciation variants, but cannot model probabilities for these variants.

See also:

python3 s5_r2/local/build_big_lexicon.py --help
python3 s5_r2/local/export_lexicon.py --help

References

If you use our scripts and/or data in your academic work please cite:

@InProceedings{milde-koehn-18-german-asr,
author = {Benjamin Milde and Arne K{\"o}hn},
title = {Open Source Automatic Speech Recognition for {German}},
booktitle = {Proceedings of ITG 2018},
year = {2018},
address = {Oldenburg, Germany},
pages = {251--255}
}

and

@inproceedings{geislinger-etal-2022-improved,
    title = "Improved Open Source Automatic Subtitling for Lecture Videos",
    author = "Geislinger, Robert  and
      Milde, Benjamin  and
      Biemann, Chris",
    booktitle = "Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)",
    month = "12--15 " # sep,
    year = "2022",
    address = "Potsdam, Germany",
    publisher = "KONVENS 2022 Organizers",
    url = "https://aclanthology.org/2022.konvens-1.11",
    pages = "98--103",
}

An open access Arxiv preprint of "Open Source Automatic Speech Recognition for German" is also available here: https://arxiv.org/abs/1807.10311 (same content as the ITG version).

We have also previously published on German open source ASR in 2015:

Open Source German Distant Speech Recognition: Corpus and Acoustic Model:

@InProceedings{Radeck-Arneth2015,
author = {Radeck-Arneth, Stephan and Milde, Benjamin and Lange, Arvid and Gouvea, Evandro and Radomski, Stefan and M{\"{u}}hlh{\"{a}}user, Max and Biemann, Chris},
booktitle = {Proceedings Text, Speech and Dialogue (TSD)},
title = {{Open Source German Distant Speech Recognition: Corpus and Acoustic Model}},
year = {2015},
address = {Pilsen, Czech Republic},
pages = {480--488}
}

If you use our training scripts or models commercially, please mention this repository in your about section, documentation or similar.