Home

Awesome

Python Speaker Diarization

Spectral Clustering Python

Speaker Diarization Spectral Clustering

Auto Tuning Spectral Clustering for SpeakerDiarization Using Normalized Maximum Eigengap

<img src="./pics/adj_mat.png" width="40%" height="40%"> <img src="./pics/gp_vs_nme.png" width="40%" height="40%">

@article{park2019auto, title={Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap}, author={Park, Tae Jin and Han, Kyu J and Kumar, Manoj and Narayanan, Shrikanth}, journal={IEEE Signal Processing Letters}, year={2019}, publisher={IEEE} }

Features of Auto-tuning NME-SC method

Auto-tuning NME-SC poposed method -

Performance Table

Track 1: Oracle VAD

SystemCALLHOMECHAES-evalCH109RT03(SW)AMI
Kaldi PLDA + AHC [1]8.39%24.27%9.72%1.73%- %
Spectral Clustering COS+B-SC [2]8.78%4.4%2.25%0.88%- %
Auto-Tuning COS+NME-SC [2]7.29%2.48%2.63%2.21%- %
Auto-Tuning COS+NME-SC Sparse-Search-20 [2]7.24%2.48%2.00%0.92%4.21%

Track 2: System VAD

SystemCALLHOMECHAES-evalCH109RT03(SW)
Kaldi PLDA + AHC [1]6.64% <br> (12.96%)1.45% <br> (5.52%)2.6% <br> (6.89%)0.99% <br> (3.53%)
Spectral Clustering COS+B-SC [2]6.91% <br> (13.23%)1.00% <br> (5.07%)1.46% <br> (5.75%)0.56% <br> (3.1%)
Auto-Tuning COS+NME-SC [2]5.41% <br> (11.73%)0.97% <br> (5.04%)1.32% <br> (5.61%)0.59% <br> (3.13%)
Auto-Tuning COS+NMME-SC Sparse-Search-20 [2]5.55% <br> (11.87%)1.00% <br> (5.06%)1.42% <br> (5.72%)0.58% <br> (3.13%)

Datasets

CALLHOME NIST SRE 2000 (LDC2001S97): The most popular diarization dataset.
CHAES-eval CALLHOME American English Subset (CHAES) (LDC97S42): English corpora for speaker diarization. train/valid/eval set.
CH-109 (LDC97S42): Sessions with 2 speakers in CHAES. Usually tested by providing the number of speakers.
RT03(SW) (LDC2007S10) : SwitchBoard part of RT03 dataset.

Reference

[1] PLDA + AHC, Callhome Diarization Xvector Model
[2] Tae Jin Park et. al., Auto Tuning Spectral Clustering for SpeakerDiarization Using Normalized Maximum Eigengap, IEEE Singal Processing Letters, 2019

Getting Started

TLDR; One-click demo script

source run_demo_clustering.sh

Prerequisites

joblib==0.14.0
numpy==1.17.4
scikit-learn==0.22
scipy==1.3.3
kaldi_io==0.9.1

Installing

You have to first have virtualenv installed on your machine. Install virtualenv with the following command:

sudo pip3 install virtualenv 

If you installed virtualenv, run the "install_venv.sh" script to make a virtual-env.

source install_venv.sh

This command will create a folder named "env_nmesc".

Usage Example

You need to prepare the followings:

  1. Segmentation files in Kaldi style format:
    <segment_id> <utt_id> <start_time> <end_time>

ex) segments

iaaa-00000-00327-00000000-00000150 iaaa 0 1.5
iaaa-00000-00327-00000075-00000225 iaaa 0.75 2.25
iaaa-00000-00327-00000150-00000300 iaaa 1.5 3
...
iafq-00000-00272-00000000-00000150 iafq 0 1.5
iafq-00000-00272-00000075-00000225 iafq 0.75 2.25
iafq-00000-00272-00000150-00000272 iafq 1.5 2.72
  1. Affinity matrix files in Kaldi scp/ark format: Each affinity matrix file should be N by N square matrix.
  2. Speaker embedding files: If you don't have affinity matrix, you can calculate cosine similarity ark files using ./sc_utils/score_embedding.sh. See run_demo_clustering.sh file to see how to calcuate cosine similarity files. (You can choose scp/ark or npy)

Running the python code with arguments:

python spectral_opt.py --distance_score_file $DISTANCE_SCORE_FILE \
                       --threshold $threshold \
                       --score-metric $score_metric \
                       --max_speaker $max_speaker \
                       --spt_est_thres $spt_est_thres \
                       --segment_file_input_path $SEGMENT_FILE_INPUT_PATH \
                       --spk_labels_out_path $SPK_LABELS_OUT_PATH \
                       --reco2num_spk $reco2num_spk 

Arguments:

# If you want to use kaldi .ark score file as an affinity matrix
DISTANCE_SCORE_FILE=$PWD/sample_CH_xvector/cos_scores/scores.scp

# If you want to use .npy numpy file as an affinity matrix
DISTANCE_SCORE_FILE=$PWD/sample_CH_xvector/cos_scores/scores.txt

Two options are available:

(1) scores.scp: Kaldi style scp file that contains the absolute path to .ark files and its binary address. Space separted <utt_id> and <path>.

ex) scores.scp

iaaa /path/sample_CH_xvector/cos_scores/scores.1.ark:5
iafq /path/sample_CH_xvector/cos_scores/scores.1.ark:23129
...

(2) scores.txt: List of <utt_id> and the absolute path to .npy files.
ex) scores.txt

iaaa /path/sample_CH_xvector/cos_scores/iaaa.npy
iafq /path/sample_CH_xvector/cos_scores/iafq.npy
...
score_metric='cos'
max_speaker=8
threshold=0.05
# You can specify a threshold.
spt_est_thres='None'
threshold=0.05 

# Or you can use NMESC in the paper to estimate the threshold.
spt_est_thres='NMESC'
threshold='None'

# Or you can specify different threshold for each utterance.
spt_est_thres="thres_utts.txt"
threshold='None'

thres_utts.txt has a format as follows: <utt_id> <threshold>

ex) thres_utts.txt

iaaa 0.105
iafq 0.215
...
segment_file_input_path=$PWD/sample_CH_xvector/xvector_embeddings/segments

ex) segments

iaaa-00000-00327-00000000-00000150 iaaa 0 1.5
iaaa-00000-00327-00000075-00000225 iaaa 0.75 2.25
iaaa-00000-00327-00000150-00000300 iaaa 1.5 3
...
iafq-00000-00272-00000000-00000150 iafq 0 1.5
iafq-00000-00272-00000075-00000225 iafq 0.75 2.25
iafq-00000-00272-00000150-00000272 iafq 1.5 2.72
reco2num_spk='None'
reco2num_spk='oracle_num_of_spk.txt'

In the text file, you must include <utt_id> and <oracle_number_of_speakers>
ex) oracle_num_of_spk.txt

iaaa 2
iafq 2
iabe 4
iadf 6
...

Cosine similarity calculator script

Running the python code for cosine similarity calculation:

data_dir=$PWD/sample_CH_xvector
pushd $PWD/sc_utils
text_yellow_info "Starting Script: affinity_score.py"
./score_embedding.sh --cmd "run.pl --mem 5G" \
                     --score-metric $score_metric \
                      $data_dir/xvector_embeddings \
                      $data_dir/cos_scores 
popd
score_metric='cos'

Expected output result of one-click script

$ source run_demo_clustering.sh 
=== [INFO] The python_envfolder exists: /.../Auto-Tuning-Spectral-Clustering/env_nmesc 
=== [INFO] Cosine similariy scores exist: /.../Auto-Tuning-Spectral-Clustering/sample_CH_xvector/cos_scores 
=== [INFO] Running Spectral Clustering with .npy input... 
=== [INFO] .scp file and .ark files were provided
Scanning eig_ratio of length [19] mat size [76] ...
1  score_metric: cos  affinity matrix pruning - threshold: 0.105  key: iaaa Est # spk: 2  Max # spk: 8  MAT size :  (76, 76)
Scanning eig_ratio of length [15] mat size [62] ...
2  score_metric: cos  affinity matrix pruning - threshold: 0.194  key: iafq Est # spk: 2  Max # spk: 8  MAT size :  (62, 62)
Method: Spectral Clustering has been finished 
=== [INFO] Computing RTTM 
=== [INFO] RTTM calculation was successful. 
=== [INFO] NMESC auto-tuning | Total Err. (DER) -[ 0.32 % ] Speaker Err. [ 0.32 % ] 
=== [INFO] .scp file and .ark files were provided
1  score_metric: cos  affinity matrix pruning - threshold: 0.050  key: iaaa Est # spk: 2  Max # spk: 8  MAT size :  (76, 76)
2  score_metric: cos  affinity matrix pruning - threshold: 0.050  key: iafq Est # spk: 5  Max # spk: 8  MAT size :  (62, 62)
Method: Spectral Clustering has been finished 
=== [INFO] Computing RTTM 
=== [INFO] RTTM calculation was successful. 
=== [INFO] Threshold 0.05 | Total Err. (DER) -[ 20.57 % ] Speaker Err. [ 20.57 % ] 
Loading reco2num_spk file:  reco2num_spk
=== [INFO] .scp file and .ark files were provided
1  score_metric: cos  Rank based pruning - RP threshold: 0.0500  key: iaaa  Given Number of Speakers (reco2num_spk): 2  MAT size :  (76, 76)
2  score_metric: cos  Rank based pruning - RP threshold: 0.0500  key: iafq  Given Number of Speakers (reco2num_spk): 2  MAT size :  (62, 62)
Method: Spectral Clustering has been finished 
=== [INFO] Computing RTTM 
=== [INFO] RTTM calculation was successful. 
=== [INFO] Known Num. Spk. | Total Err. (DER) -[ 0.15 % ] Speaker Err. [ 0.15 % ] 

Authors

Tae Jin Park: inctrljinee@gmail.com, tango4j@gmail.com
Kyu J.
Manoj Kumar
Shrikanth Narayanan