Home

Awesome

<div align="center">Deep speaker embeddings in PyTorch</div>

This repository contains code and models for training an x-vector speaker recognition model using Kaldi for feature preparation and PyTorch for DNN model training. MFCC feature configurations and TDNN model architecture follow the Voxceleb recipe in Kaldi (commit hash 9b4dc93c9). Training procedures including optimizer and step count are similar to, but not exactly the same as Kaldi.

Additionally, code for training meta-learning embeddings are available in train_proto.py and train_relation.py. An overview of these models is available at https://arxiv.org/abs/2007.16196 and in the below figure:

Overview: Meta Learning Models

Citation

If you found this toolkit useful in your research, consider citing the following:

@misc{kumar2020designing,
    title={Designing Neural Speaker Embeddings with Meta Learning},
    author={Manoj Kumar and Tae Jin-Park and Somer Bishop and Catherine Lord and Shrikanth Narayanan},
    year={2020},    
    eprint={2007.16196},
    archivePrefix={arXiv}  
}

Requirements:

Python Libraries

python==3.6.10
torch==1.4.0
kaldiio==2.15.1
kaldi-python-io==1.0.4
Other Tools:

Installation:

Data preparation

Training data preparation

Dataset for data augmentation

pytorch_run.sh script augments the training data using the following two datasets.

Training

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 train_xent.py <egsDir>
usage: train_xent.py [-h] [--local_rank LOCAL_RANK] [-modelType MODELTYPE]
                     [-featDim FEATDIM] [-resumeTraining RESUMETRAINING]
                     [-resumeModelDir RESUMEMODELDIR]
                     [-numArchives NUMARCHIVES] [-numSpkrs NUMSPKRS]
                     [-logStepSize LOGSTEPSIZE] [-batchSize BATCHSIZE]
                     [-numEgsPerArk NUMEGSPERARK]
                     [-preFetchRatio PREFETCHRATIO]
                     [-optimMomentum OPTIMMOMENTUM] [-baseLR BASELR]
                     [-maxLR MAXLR] [-numEpochs NUMEPOCHS]
                     [-noiseEps NOISEEPS] [-pDropMax PDROPMAX]
                     [-stepFrac STEPFRAC]
                     egsDir

positional arguments:
  egsDir                Directory with training archives

optional arguments:
  -h, --help            show this help message and exit
  --local_rank LOCAL_RANK
  -modelType MODELTYPE  Refer train_utils.py
  -featDim FEATDIM      Frame-level feature dimension
  -resumeTraining RESUMETRAINING
                        (1) Resume training, or (0) Train from scratch
  -resumeModelDir RESUMEMODELDIR
                        Path containing training checkpoints
  -numArchives NUMARCHIVES
                        Number of egs.*.ark files
  -numSpkrs NUMSPKRS    Number of output labels
  -logStepSize LOGSTEPSIZE
                        Iterations per log
  -batchSize BATCHSIZE  Batch size
  -numEgsPerArk NUMEGSPERARK
                        Number of training examples per egs file
  -preFetchRatio PREFETCHRATIO
                        xbatchSize to fetch from dataloader
  -optimMomentum OPTIMMOMENTUM
                        Optimizer momentum
  -baseLR BASELR        Initial LR
  -maxLR MAXLR          Maximum LR
  -numEpochs NUMEPOCHS  Number of training epochs
  -noiseEps NOISEEPS    Noise strength before pooling
  -pDropMax PDROPMAX    Maximum dropout probability
  -stepFrac STEPFRAC    Training iteration when dropout = pDropMax

egsDir contains the nnet3 egs files.

Embedding extraction

usage: extract.py [-h] [-modelType MODELTYPE] [-numSpkrs NUMSPKRS]
                  modelDirectory featDir embeddingDir

positional arguments:
  modelDirectory        Directory containing the model checkpoints
  featDir               Directory containing features ready for extraction
  embeddingDir          Output directory

optional arguments:
  -h, --help            show this help message and exit
  -modelType MODELTYPE  Refer train_utils.py
  -numSpkrs NUMSPKRS    Number of output labels for model

The script pytorch_run.sh can be used to train embeddings on the voxceleb recipe on an end-to-end basis.

Pretrained model

Downloading

Two ways to download the pre-trained model:

  1. Google Drive link (or)
  2. Command line (reference)
    wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1gbAWDdWN_pkOim4rWVXUlfuYjfyJqUHZ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1gbAWDdWN_pkOim4rWVXUlfuYjfyJqUHZ" -O preTrainedModel.zip && rm -rf /tmp/cookies.txt
    

Speaker Verification

To reproduce voxceleb EER results with the pretrained model, follow the below steps. NOTE: The voxceleb features must be prepared using prepare_feats_for_egs.sh prior to evaluation.

  1. Extract models/ and xvectors/ from the pre-trained archive into the installation directory
  2. Set the following variables in pytorch_run.sh:
    modelDir=models/xvec_preTrained
    trainFeatDir=data/train_combined_no_sil
    trainXvecDir=xvectors/xvec_preTrained/train
    testFeatDir=data/voxceleb1_test_no_sil
    testXvecDir=xvectors/xvec_preTrained/test
    
  3. Extract embeddings and compute EER, minDCF. Set stage=7 in pytorch_run.sh and execute:
    bash pytorch_run.sh
    
  4. Alternatively, pretrained PLDA model is available inside xvectors/train directory. Set stage=9 in pytorch_run.sh and execute:
    bash pytorch_run.sh
    

Speaker Diarization

cd egs/

Place the audio files to diarize and their corresponding RTTM files in demo_wav/ and demo_rttm/ directories. Execute:

bash diarize.sh

Results

1. Speaker Verification (%EER)

Kaldipytorch_xvectors
Vox1-test3.132.82
VOICES-dev10.308.59

2. Speaker Diarization (%DER)

NOTE: Clustering using https://github.com/tango4j/Auto-Tuning-Spectral-Clustering

Kaldipytorch_xvectors
DIHARD2 dev (no collar, oracle #spk)26.9727.50
DIHARD2 dev (no collar, est #spk)24.4924.66
AMI dev+test (26 meetings, collar, oracle #spk)6.396.30
AMI dev+test (26 meetings, collar, est #spk)7.2910.14