Home

Awesome

Due to the copyright issue, I can only publish the single GPU version which was developed before Jan. 2019. Some implementations need to be improved as well, e.g., the GPU memory allocation. The library can still be used as a framework for speaker verification. Multi-GPU and other approaches could be added with fewer efforts.

Note When you extract the speaker embedding using extract.sh, make sure that your TensorFlow is compiled WITHOUT MKL. As I know, some versions of TF installed by anaconda are compiled with MKL. It will use multiple threads when TF is running on CPUs. This is harmful if you run multiple processes (say 40). The threads conflict will make the extraction extremely slow. For me, I use pip to install TF 1.12, and that works.


Overview

The tf-kaldi-speaker implements a neural network based speaker verification system using Kaldi and TensorFlow.

The main idea is that Kaldi can be used to do the pre- and post-processing while TF is a better choice to build the neural network. Compared with Kaldi nnet3, the modification of the network (e.g. adding attention, using different loss functions) using TF costs less. Adding other features to support text-dependent speaker verification is also possible.

The purpose of the project is to make researches on neural network-based speaker verification easier. I also try to reproduce some results in my papers.

Requirement

Methodology

The general pipeline of our framework is:

  1. Kaldi: Data preparation --> feature extraction --> training example generation (CMVN + VAD + ...)
  2. TF: Network training (training examples + nnet config)
  1. Kaldi: Data preparation --> feature extraction
  2. TF: Embedding extraction
  3. Kaldi: Backend classifier (Cosine/PLDA) --> performance evaluation

In our framework, the speaker embedding can be trained and extracted using different network architectures. Again, the backend classifier is integrated using Kaldi.

Features

Usage

Performance & Speed

Pretrained models

Pros and cons

Other discussions

License

Apache License, Version 2.0 (Refer to LICENCE)

Acknowledgements

The computational resources are initially provided by Prof. Mark Gales in Cambridge University Engineering Department (CUED). After my visit to Cambridge, the resources are mainly supported by Dr. Liang He in Tsinghua University Electronic Engineering Department (THUEE).

Last ...

Related papers

@inproceedings{liu2019speaker,
   author={Yi Liu and Liang He and Jia Liu},
   Title = {Large Margin Softmax Loss for Speaker Verification},
   BookTitle = {Proc. INTERSPEECH},
   Year = {2019}
}