Awesome

Recipes for using open-source ASR corpora

Recipes for using open-source ASR corpora with Kaldi.

This is not an official Google product.

Languages

Language	Directory	Corpus
Javanese	jv	Open SLR 35
Sundanese	su	Open SLR 36
Sinhala	si	Open SLR 52

How to use

The above corpora are ready for use with Kaldi, after some simple data munging. We provide a small Kaldi recipe for training a triphone recognizer, inspired by the start of Kaldi's Resource Management recipe. The recipe is only intended for illustration and for validating the corpus and data preparation.

Prerequisites

Kaldi. First download Kaldi from GitHub, compile, and install.
Flac. The scripts below use the flac command line tool (assumed to be on the shell PATH) for on-the-fly decompression of the corpus.
Python and Bash.

General steps

IMPORTANT: You must define and export an environment variable KALDI_ROOT pointing at your Kaldi directory.
Download and unpack the corpora you need.
Change to a recipe directory and execute run.sh.

Example

Here is how to use the Javanese corpus:

sudo apt-get install flac wget
git clone https://github.com/kaldi-asr/kaldi
cd kaldi
export KALDI_ROOT="$(realpath .)"
cat INSTALL
# and follow the instructions there to build Kaldi
cd ..
git clone https://github.com/googlei18n/asr-recipes
cd asr-recipes
tools/download_data.sh jv
# this unpacks the Javanese corpus into asr_javanese
cd jv
./run.sh

License

Unless otherwise noted, all original files are licensed under an Apache License, Version 2.0.