Awesome

ipa2kaldi

Creates a Kaldi nnet3 recipe from transcribed audio using the International Phonetic Alphabet for word pronunciations. Unknown words have pronunciations predicted with phonetisaurus.

This project is inspired by Zamia Speech, and is intended to supply acoustic models built from open speech corpora to the Rhasspy project for many human languages.

Check out the pre-trained models.

Dependencies

Python 3.7 or higher
CUDA and cuDNN
- See installing CUDA
Kaldi compiled with support for CUDA
- Install CUDA/cuDNN before compiling Kaldi
- See installing Kaldi
- Tested on Ubuntu 18.04 (bionic) with CUDA 10.2 and cuDNN 7.6
gruut
- Used to generate IPA word pronunciations

Data Sources

ipa2kaldi does not automatically download or unpack audio datasets for you. A dataset is expected to exist in a single directory with:

A metadata.csv file
- Delimiter is | and there is no header
- Either id|text (need --speaker argument) or id|speaker|text
- Corresponding WAV file must be named <id>.wav
WAV files in 16Khz 16-bit mono PCM format

Installation

Download the source code and create the Python virtual environment:

$ pip install ipa2kaldi

for Raspberry Pi (ARM), you will first need to manually install phonetisaurus.

Usage

$ python3 -m ipa2kaldi /path/to/kaldi/egs/<model_name>/s5 \
    --language <language-code> \
    --dataset /path/to/dataset1 \
    --dataset /path/to/dataset2 \

where:

<model_name> is a name you choose
<language_code> is a supported language from gruut like en-us

If all goes well, you should now have a Kaldi recipe directory under egs/<model_name>/s5.

Before training, you must place a gzipped ARPA language model at egs/<model_name>/s5/lm/lm.arpa.gz

After that, run:

$ cd /path/to/kaldi/egs/<model_name>/s5
$ ./run.sh

This will train a new TDNN nnet3 model in the recipe directory. It can take a day or two, depending on how powerful your computer is. If a particular training stage fails (see run.sh), you can resume with ./run.sh --stage N where N is the stage to start at.

Training Workflow

The typical training workflow is described below.

Training transcriptions are tokenized and cleaned using gruut
Vocabulary words looked up in IPA lexicon(s)
- Unknown words have pronunciations guessed with phonetisaurus model trained on IPA lexicon(s)
Lexicon is created from generated/pre-built pronunciations
- Use <unk> for unknown word
- Use SPN (spoken noise) silence phoneme for <unk>
Kaldi recipe files are generated
- Non-silence phones are manually grouped for extra_questions.txt
- SIL, SPN, NSN silence phones
- SIL is optional
Kaldi test/train files are generated
- 10%/90% data split
- wav.scp, text, and utt2spk
Do Kaldi training with run.sh script
1. Prepares dict/lang directories
2. Adapts language model for Kaldi
3. Creates MFCC features
4. Trains monophone system
5. Trains triphone system (1b)
6. Trains triphone system (2b)
7. Generates iVectors
8. Generates topology
9. Gets alignment lattices
10. Builds tree
11. Trains TDNN 250 nnet3 model

Recipe Layout

The output of this project is a Kaldi recipe that lives inside your Kaldi egs directory, such as /path/to/kaldi/egs/rhasspy_nnet3_en-us/s5. When scripts/doit.sh succeeds, this directory should contain the following files:

s5/
- run.sh
- export.sh
- data/
  - conf/
    - mfcc.conf
    - mfcc_hires.conf
    - online_cmvn.conf
  - local/
    - dict/
      - lexicon.txt.gz
        
        WORD P1 P2 ...
      - nonsilence_phones.txt
        
        Actual phonemes
      - silence_phones.txt
        
        SIL
        
        SPN
        
        NSN
      - optional_silence.txt
        
        SIL
      - extra_questions.txt
        
        Phones grouped by accents/elongation
  - train/
    - wav.scp
      - UTT_ID /path/to/wav
      - Sorted by UTT_ID
    - utt2spk
      - UTT_ID speaker
      - Sorted by UTT_ID, then speaker
    - text
      - UTT_ID transcription
  - test/
    - wav.scp
      - Same as train
    - utt2spk
      - Same as train
    - text
      - Same as train
- lm/
  - lm.arpa.gz
    - ARPA language model

Installing CUDA

Below are summarized instructions from this Medium article for Ubuntu 18.04 (bionic) with CUDA 10.2 and cuDNN 7.6.

First, add the CUDA repos:

$ sudo apt update
$ sudo add-apt-repository ppa:graphics-drivers
$ sudo apt-key adv --fetch-keys  'http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub'
$ sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /" > /etc/apt/sources.list.d/cuda.list'
$ sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/ /" > /etc/apt/sources.list.d/cuda_learn.list'

Next, install CUDA and cuDNN:

$ sudo apt update
$ sudo apt install cuda-10-2
$ sudo apt install libcudnn7

If installation succeeds, add the following text to ~/.profile

# set PATH for cuda 10.2 installation in ~/.profile
if [ -d "/usr/local/cuda-10.2/bin/" ]; then
  export PATH=/usr/local/cuda-10.2/bin${PATH:+:${PATH}}
  export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
fi

After rebooting, check if everything works by running nvidia-smi and verifying the version of CUDA reported.

Installing Kaldi

Install dependencies:

$ sudo apt-get update
$ sudo apt-get install \
    build-essential \
    wget curl ca-certificates \
    libatlas-base-dev libatlas3-base gfortran \
    automake autoconf unzip sox libtool subversion \
    python3 python \
    git zlib1g-dev patchelf rsync

Download the Kaldi source code:

$ git clone git clone https://github.com/kaldi-asr/kaldi.git

Build dependencies (replace -j8 with -j4 if you have fewer CPU cores):

$ cd kaldi/tools
$ make -j8

Build Kaldi itself (replace -j8 with -j4 if you have fewer CPU cores):

$ cd ../src
$ ./configure --use-cuda --shared --mathlib=ATLAS
$ make depend -j8
$ make -j8

See the getting started guide if you have problems.

Pre-Trained Models

The following nnet3 models have been trained with ipa2kaldi using public speech data:

These models are intended to be used with rhasspy-asr-kaldi from the Rhasspy voice assistant.