Awesome
Recipes for using open-source ASR corpora
Recipes for using open-source ASR corpora with Kaldi.
This is not an official Google product.
Languages
Language | Directory | Corpus |
---|---|---|
Javanese | jv | Open SLR 35 |
Sundanese | su | Open SLR 36 |
Sinhala | si | Open SLR 52 |
How to use
The above corpora are ready for use with Kaldi, after some simple data munging. We provide a small Kaldi recipe for training a triphone recognizer, inspired by the start of Kaldi's Resource Management recipe. The recipe is only intended for illustration and for validating the corpus and data preparation.
Prerequisites
- Kaldi. First download Kaldi from GitHub, compile, and install.
- Flac. The scripts below use the
flac
command line tool (assumed to be on the shellPATH
) for on-the-fly decompression of the corpus. - Python and Bash.
General steps
- IMPORTANT: You must define and export an environment variable
KALDI_ROOT
pointing at your Kaldi directory. - Download and unpack the corpora you need.
- Change to a recipe directory and execute
run.sh
.
Example
Here is how to use the Javanese corpus:
sudo apt-get install flac wget
git clone https://github.com/kaldi-asr/kaldi
cd kaldi
export KALDI_ROOT="$(realpath .)"
cat INSTALL
# and follow the instructions there to build Kaldi
cd ..
git clone https://github.com/googlei18n/asr-recipes
cd asr-recipes
tools/download_data.sh jv
# this unpacks the Javanese corpus into asr_javanese
cd jv
./run.sh
License
Unless otherwise noted, all original files are licensed under an Apache License, Version 2.0.