Awesome

Rhasspy ASR Kaldi

Automated speech recognition in Rhasspy voice assistant with Kaldi.

Requirements

Python 3.7
Kaldi
- Expects $KALDI_DIR in environment
Opengrm
- Expects ngram* in $PATH
Phonetisaurus
- Expects phonetisaurus-apply in $PATH

See pre-built apps for pre-compiled binaries.

Installation

$ git clone https://github.com/rhasspy/rhasspy-asr-kaldi
$ cd rhasspy-asr-kaldi
$ ./configure
$ make
$ make install

Transcribing

Use python3 -m rhasspyasr_kaldi transcribe <ARGS>

usage: rhasspy-asr-kaldi transcribe [-h] --model-dir MODEL_DIR
                                    [--graph-dir GRAPH_DIR]
                                    [--model-type MODEL_TYPE]
                                    [--frames-in-chunk FRAMES_IN_CHUNK]
                                    [wav_file [wav_file ...]]

positional arguments:
  wav_file              WAV file(s) to transcribe

optional arguments:
  -h, --help            show this help message and exit
  --model-dir MODEL_DIR
                        Path to Kaldi model directory (with conf, data)
  --graph-dir GRAPH_DIR
                        Path to Kaldi graph directory (with HCLG.fst)
  --model-type MODEL_TYPE
                        Either nnet3 or gmm (default: nnet3)
  --frames-in-chunk FRAMES_IN_CHUNK
                        Number of frames to process at a time

For nnet3 models, the online2-tcp-nnet3-decode-faster program is used to handle streaming audio. For gmm models, audio is buffered and packaged as a WAV file before being transcribed.

Training

Use python3 -m rhasspyasr_kaldi train <ARGS>

usage: rhasspy-asr-kaldi train [-h] --model-dir MODEL_DIR
                               [--graph-dir GRAPH_DIR]
                               [--intent-graph INTENT_GRAPH]
                               [--dictionary DICTIONARY]
                               [--dictionary-casing {upper,lower,ignore}]
                               [--language-model LANGUAGE_MODEL]
                               --base-dictionary BASE_DICTIONARY
                               [--g2p-model G2P_MODEL]
                               [--g2p-casing {upper,lower,ignore}]

optional arguments:
  -h, --help            show this help message and exit
  --model-dir MODEL_DIR
                        Path to Kaldi model directory (with conf, data)
  --graph-dir GRAPH_DIR
                        Path to Kaldi graph directory (with HCLG.fst)
  --intent-graph INTENT_GRAPH
                        Path to intent graph JSON file (default: stdin)
  --dictionary DICTIONARY
                        Path to write custom pronunciation dictionary
  --dictionary-casing {upper,lower,ignore}
                        Case transformation for dictionary words (training,
                        default: ignore)
  --language-model LANGUAGE_MODEL
                        Path to write custom language model
  --base-dictionary BASE_DICTIONARY
                        Paths to pronunciation dictionaries
  --g2p-model G2P_MODEL
                        Path to Phonetisaurus grapheme-to-phoneme FST model
  --g2p-casing {upper,lower,ignore}
                        Case transformation for g2p words (training, default:
                        ignore)

This will generate a custom HCLG.fst from an intent graph created using rhasspy-nlu. Your Kaldi model directory should be laid out like this:

my_model/ (--model-dir)
- conf/
  - mfcc_hires.conf
- data/
  - local/
    - dict/
      - lexicon.txt (copied from --dictionary)
    - lang/
      - lm.arpa.gz (copied from --language-model)
- graph/ (--graph-dir)
  - HCLG.fst (generated)
- model/
  - final.mdl
- phones/
  - extra_questions.txt
  - nonsilence_phones.txt
  - optional_silence.txt
  - silence_phones.txt
- online/ (nnet3 only)
- extractor/ (nnet3 only)

When using the train command, you will need to specify the following arguments:

--intent-graph - path to graph json file generated using rhasspy-nlu
--model-type - either nnet3 or gmm
--model-dir - path to top-level model directory (my_model in example above)
--graph-dir - path to directory where HCLG.fst should be written (my_model/graph in example above)
--base-dictionary - pronunciation dictionary with all words from intent graph (can be used multiple times)
--dictionary - path to write custom pronunciation dictionary (optional)
--language-model - path to write custom ARPA language model (optional)

Building From Source

rhasspy-asr-kaldi depends on the following programs that must be compiled:

Kaldi
- Speech to text engine
Opengrm
- Create ARPA language models
Phonetisaurus
- Guesses pronunciations for unknown words

Kaldi

Make sure you have the necessary dependencies installed:

sudo apt-get install \
    build-essential \
    libatlas-base-dev libatlas3-base gfortran \
    automake autoconf unzip sox libtool subversion \
    python3 python \
    git zlib1g-dev

Download Kaldi and extract it:

wget -O kaldi-master.tar.gz \
    'https://github.com/kaldi-asr/kaldi/archive/master.tar.gz'
tar -xvf kaldi-master.tar.gz

First, build Kaldi's tools:

cd kaldi-master/tools
make

Use make -j 4 if you have multiple CPU cores. This will take a long time.

Next, build Kaldi itself:

cd kaldi-master
./configure --shared --mathlib=ATLAS
make depend
make

Use make depend -j 4 and make -j 4 if you have multiple CPU cores. This will take a long time.

There is no installation step. The kaldi-master directory contains all the libraries and programs that Rhasspy will need to access.

See docker-kaldi for a Docker build script.

Phonetisaurus

Make sure you have the necessary dependencies installed:

sudo apt-get install build-essential

First, download and build OpenFST 1.6.2

wget http://www.openfst.org/twiki/pub/FST/FstDownload/openfst-1.6.2.tar.gz
tar -xvf openfst-1.6.2.tar.gz
cd openfst-1.6.2
./configure \
    "--prefix=$(pwd)/build" \
    --enable-static --enable-shared \
    --enable-far --enable-ngram-fsts
make
make install

Use make -j 4 if you have multiple CPU cores. This will take a long time.

Next, download and extract Phonetisaurus:

wget -O phonetisaurus-master.tar.gz \
    'https://github.com/AdolfVonKleist/Phonetisaurus/archive/master.tar.gz'
tar -xvf phonetisaurus-master.tar.gz

Finally, build Phonetisaurus (where /path/to/openfst is the openfst-1.6.2 directory from above):

cd Phonetisaurus-master
./configure \
    --with-openfst-includes=/path/to/openfst/build/include \
    --with-openfst-libs=/path/to/openfst/build/lib
make
make install

Use make -j 4 if you have multiple CPU cores. This will take a long time.

You should now be able to run the phonetisaurus-align program.

See docker-phonetisaurus for a Docker build script.