Home

Awesome

Awesome Speaker Diarization Awesome Contribution

Table of contents

Overview

This is a curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.

The purpose of this repo is to organize the world’s resources for speaker diarization, and make them universally accessible and useful.

To add items to this page, simply send a pull request. (contributing guide)

Publications

Special topics

Review & survey papers

Large language model (LLM)

Supervised diarization

Joint diarization and ASR

Online speaker diarization

Challenges

Audio-Visual Speaker Diarization

Other

2021

2020

2019

2018

2017

2016

2015

2014

2013

2011

2009

2008

2006

Software

Framework

LinkLanguageDescription
FunASR GitHub starsPython & PyTorchFunASR is an open-source speech toolkit based on PyTorch, which aims at bridging the gap between academic researchs and industrial applications.
MiniVox GitHub starsMATLABMiniVox is an open-source evaluation system for the online speaker diarization task.
SpeechBrain GitHub starsPython & PyTorchSpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.
SIDEKIT for diarization (s4d)PythonAn open source package extension of SIDEKIT for Speaker diarization.
pyAudioAnalysis GitHub starsPythonPython Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.
AaltoASR GitHub starsPython & PerlSpeaker diarization scripts, based on AaltoASR.
LIUM SpkDiarizationJavaLIUM_SpkDiarization is a software dedicated to speaker diarization (i.e. speaker segmentation and clustering). It is written in Java, and includes the most recent developments in the domain (as of 2013).
kaldi-asr Build StatusBashExample scripts for speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation.
kaldi-speaker-diarization GitHub starsBashIcelandic speaker diarization scripts using kaldi.
Alize LIA_SpkSegC++ALIZÉ is an opensource platform for speaker recognition. LIA_SpkSeg is the tools for speaker diarization.
pyannote-audio GitHub starsPythonNeural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding.
pyBK GitHub starsPythonSpeaker diarization using binary key speaker modelling. Computationally light solution that does not require external training data.
Speaker-Diarization GitHub starsPythonSpeaker diarization using uis-rnn and GhostVLAD. An easier way to support openset speakers.
EEND GitHub starsPython & Bash & PerlEnd-to-End Neural Diarization.
VBx GitHub starsPythonVariational Bayes HMM over x-vectors diarization. x-vector extractor recipe
RE-VERB GitHub starsPython & JavaScriptRE: VERB is speaker diarization system, it allows the user to send/record audio of a conversation and receive timestamps of who spoke when.
StreamingSpeakerDiarization GitHub starsPythonStreaming speaker diarization, extends pyannote.audio to online processing
simple_diarizerPythonSimplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diarized segments.
Picovoice Falcon GitHub starsC & PythonA lightweight, accurate, and fast speaker diarization engine written in C and available in Python, running on CPU with minimal overhead.
DiaPer GitHub starsPythonPytorch implementation for DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors including models pre-trained on free and public data.
sherpa-onnx GitHub starsC++ & C & C# & Dart & Go & Java & JavaScript & Kotlin & Pascal & Python & Rust & SwiftSupport speaker diarization, speech recognition, and text-to speech on various platforms with various language bindings.

Evaluation

LinkLanguageDescription
pyannote-metrics GitHub stars Build StatusPythonA toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems.
SimpleDER GitHub stars Python packagePythonA lightweight library to compute Diarization Error Rate (DER).
DiarizationLM GitHub stars Build StatusPythonImplements Word Error Rate (WER), Word Diarization Error Rate (WDER), and concatenated minimum-permutation Word Error Rate (cpWER).
NIST md-evalPerl(1) modified md-eval.pl from Mary Tai Knox; (2) md-eval-v21.pl from jitendra; (3) md-eval-22.pl from nryant
dscore GitHub starsPython & PerlDiarization scoring tools.
Sequence Match AccuracyPythonMatch the accuracy of two sequences with Hungarian algorithm.
spyder GitHub starsPython & C++Simple Python package for fast DER computation.
CDER GitHub starsPythonConversational DER from The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines

Clustering

LinkLanguageDescription
uis-rnn GitHub stars Build StatusPython & PyTorchGoogle's Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, for Fully Supervised Speaker Diarization. This clustering algorithm is supervised.
uis-rnn-sml GitHub starsPython & PyTorchA variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data.
DNC GitHub starsPython & ESPnetTransformer-based Discriminative Neural Clustering (DNC) for Speaker Diarisation. Like UIS-RNN, it is supervised.
SpectralCluster GitHub stars Build StatusPythonSpectral clustering with affinity matrix refinement operations, auto-tune, and speaker turn constraints.
sklearn.cluster Build StatusPythonscikit-learn clustering algorithms.
PLDA GitHub starsPythonProbabilistic Linear Discriminant Analysis & classification, written in Python.
PLDA GitHub starsC++Open-source implementation of simplified PLDA (Probabilistic Linear Discriminant Analysis).
Auto-Tuning Spectral Clustering GitHub starsPythonAuto-tuning Spectral Clustering method that does not need development set or supervised tuning.

Speaker embedding

LinkMethodLanguageDescription
resemble-ai/Resemblyzer GitHub starsd-vectorPython & PyTorchPyTorch implementation of generalized end-to-end loss for speaker verification, which can be used for voice cloning and diarization.
Speaker_Verification GitHub starsd-vectorPython & TensorFlowTensorflow implementation of generalized end-to-end loss for speaker verification.
PyTorch_Speaker_Verification GitHub starsd-vectorPython & PyTorchPyTorch implementation of "Generalized End-to-End Loss for Speaker Verification" by Wan, Li et al. With UIS-RNN integration.
Real-Time Voice Cloning GitHub starsd-vectorPython & PyTorchImplementation of "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" (SV2TTS) with a vocoder that works in real-time.
conformer-speaker-encoderd-vectorPython & TFLiteMassively multilingual conformer-based speaker recognition models in TFLite format.
deep-speaker GitHub starsd-vectorPython & KerasThird party implementation of the Baidu paper Deep Speaker: an End-to-End Neural Speaker Embedding System.
x-vector-kaldi-tf GitHub starsx-vectorPython & TensorFlow & PerlTensorflow implementation of x-vector topology on top of Kaldi recipe.
kaldi-ivector GitHub starsi-vectorC++ & PerlExtension to Kaldi implementing the standard i-vector hyperparameter estimation and i-vector extraction procedure.
voxceleb-ivector GitHub starsi-vectorPerlVoxceleb1 i-vector based speaker recognition system.
pytorch_xvectors GitHub starsx-vectorPython & PyTorchPyTorch implementation of Voxceleb x-vectors. Additionaly, includes meta-learning architectures for embedding training. Evaluated with speaker diarization and speaker verification.
ASVtorchi-vectorPython & PyTorchASVtorch is a toolkit for automatic speaker recognition.
asv-subtools GitHub starsi-vector & x-vectorKaldi & PyTorchASV-Subtools is developed based on Pytorch and Kaldi for the task of speaker recognition, language identification, etc. The 'sub' of 'subtools' means that there are many modular tools and the parts constitute the whole.
WeSpeaker GitHub starsx-vector & r-vectorPython & C++ & PyTorchWeSpeaker is a research and production oriented speaker verification, recognition and diarization toolkit, which supports very strong recipes with on-the-fly data preparation, model training and evaluation, as well as runtime C++ codes.
ReDimNet GitHub starsimproved resnetPytorchNeural network architecture presented in the paper Reshape Dimensions Network for Speaker Recognition

Speaker change detection

LinkLanguageDescription
change_detection GitHub starsPython & KerasCode for Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks.
tidydiarize GitHub starsPythonDiarization inside OpenAI Whisper decoder

Audio feature extraction

LinkLanguageDescription
LibROSA GitHub starsPythonPython library for audio and music analysis. https://librosa.github.io/
python_speech_features GitHub starsPythonThis library provides common speech features for ASR including MFCCs and filterbank energies. https://python-speech-features.readthedocs.io/en/latest/
pyAudioAnalysis GitHub starsPythonPython Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.

Audio data augmentation

LinkLanguageDescription
pyroomacoustics GitHub starsPythonPyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios. https://pyroomacoustics.readthedocs.io
gpuRIR GitHub starsPythonPython library for Room Impulse Response (RIR) simulation with GPU acceleration
rir_simulator_python GitHub starsPythonRoom impulse response simulator using python
WavAugment GitHub starsPython & PyTorchWavAugment performs data augmentation on audio data. The audio data is represented as pytorch tensors
EEND_dataprep GitHub starsBash & PythonRecipes for generating simulated conversations used to train end-to-end diarization models.

Other software

LinkLanguageDescription
VB Diarization GitHub stars Build StatusPythonVB Diarization with Eigenvoice and HMM Priors.
DOVER-Lap GitHub starsPythonPython package for combining diarization system outputs
Diar-azPythonData formatting tool to support the ruv-di dataset. Kaldi to Gecko to Kaldi and corpus and back

Datasets

Diarization datasets

AudioDiarization ground truthLanguagePricingAdditional information
2000 NIST Speaker Recognition EvaluationDisk-6 (Switchboard), Disk-8 (CALLHOME)Multiple$2400.00Evaluation Plan
2003 NIST Rich Transcription Evaluation DataTogether with audiosen, ar, zh$2000.00telephone speech, broadcast news
CALLHOME American English SpeechCALLHOME American English Transcriptsen$1500.00 + $1000.00CH109 whitelist
The ICSI Meeting CorpusTogether with audiosenFreeLicense
The AMI Meeting CorpusTogether with audios (need to be processed)MultipleFreeLicense
Fisher English Training Speech Part 1 SpeechFisher English Training Speech Part 1 Transcriptsen$7000.00 + $1000.00
Fisher English Training Part 2, SpeechFisher English Training Part 2, Transcriptsen$7000.00 + $1000.00
VoxConverseTBDTBDFreeVoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos
MiniVox BenchmarkMiniVox BenchmarkenFreeMiniVox is an automatic framework to transform any speaker-labelled dataset into continuous speech datastream with episodically revealed label feedbacks.
The AliMeeting CorpusTogether with audioszhFree

Speaker embedding training sets

NameUtterancesSpeakersLanguagePricingAdditional information
TIMIT6K+630en$250.00Published in 1993, the TIMIT corpus of read speech is one of the earliest speaker recognition datasets.
VCTK43K+109enFreeMost were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent.
LibriSpeech292K2K+enFreeLarge-scale (1000 hours) corpus of read English speech.
Multilingual LibriSpeech (MLS)??en, de, nl, es, fr, it, pt, poFreeMultilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
LibriVox180K9K+MultipleFreeFree public domain audiobooks. LibriSpeech is a processed subset of LibriVox. Each original unsegmented utterance could be very long.
VoxCeleb 1&21M+7KMultipleFreeVoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.
The Spoken Wikipedia Corpora5K879en, de, nlFreeVolunteer readers reading Wikipedia articles.
CN-Celeb130K+1KzhFreeA Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University.
BookTubeSpeech8K8KenFreeAudio samples extracted from BookTube videos - videos where people share their opinions on books - from YouTube. The dataset can be downloaded using BookTubeSpeech-download.
DeepMine540K1850fa, enUnknownA speech database in Persian and English designed to build and evaluate speaker verification, as well as Persian ASR systems.
NISP-Dataset?345hi, kn, ml, ta, te (all Indian languages)FreeThis dataset contains speech recordings along with speaker physical parameters (height, weight, ... ) as well as regional information and linguistic information.
VoxBlink210M100k+18 lanugages (en, pt, es, ru, ar, ...)CC BY-NC-SA 4.0Multilingual dataset from VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark

Augmentation noise sources

NameUtterancesPricingAdditional information
AudioSet2MFreeA large-scale dataset of manually annotated audio events.
MUSANN/AFreeMUSAN is a corpus of music, speech, and noise recordings.

Conferences

Conference/WorkshopFrequencyPage LimitOrganizationBlind Review
ICASSPAnnual4 + 1 (ref)IEEENo
InterSpeechAnnual4 + 1 (ref)ISCANo
Speaker OdysseyBiennial8 + 2 (ref)ISCANo
SLTBiennial6 + 2 (ref)IEEEYes
ASRUBiennial6 + 2 (ref)IEEEYes
WASPAABiennial4 + 1 (ref)IEEENo
IJCBAnnual8IEEE & IAPR TC-4Yes

Other learning materials

Online courses

Books

Tech blogs

Video tutorials

Products

CompanyProduct
GoogleRecorder app
GoogleGoogle Cloud Speech-to-Text API
AmazonAmazon Transcribe
IBMWatson Speech To Text API
DeepAffectsSpeaker Diarization API
AlibabaTingwu (听悟)
MicrosoftAzure Conversation Transcription API

Star History

Star History Chart