Home

Awesome

Recipes for using open-source ASR corpora

Recipes for using open-source ASR corpora with Kaldi.

This is not an official Google product.

Languages

LanguageDirectoryCorpus
JavanesejvOpen SLR 35
SundanesesuOpen SLR 36
SinhalasiOpen SLR 52

How to use

The above corpora are ready for use with Kaldi, after some simple data munging. We provide a small Kaldi recipe for training a triphone recognizer, inspired by the start of Kaldi's Resource Management recipe. The recipe is only intended for illustration and for validating the corpus and data preparation.

Prerequisites

  1. Kaldi. First download Kaldi from GitHub, compile, and install.
  2. Flac. The scripts below use the flac command line tool (assumed to be on the shell PATH) for on-the-fly decompression of the corpus.
  3. Python and Bash.

General steps

  1. IMPORTANT: You must define and export an environment variable KALDI_ROOT pointing at your Kaldi directory.
  2. Download and unpack the corpora you need.
  3. Change to a recipe directory and execute run.sh.

Example

Here is how to use the Javanese corpus:

sudo apt-get install flac wget
git clone https://github.com/kaldi-asr/kaldi
cd kaldi
export KALDI_ROOT="$(realpath .)"
cat INSTALL
# and follow the instructions there to build Kaldi
cd ..
git clone https://github.com/googlei18n/asr-recipes
cd asr-recipes
tools/download_data.sh jv
# this unpacks the Javanese corpus into asr_javanese
cd jv
./run.sh

License

Unless otherwise noted, all original files are licensed under an Apache License, Version 2.0.