Awesome
tinydiarize 🐥🗣️
- Speaker diarization labels who said what in a transcript (e.g. Speaker A, Speaker B …). It is essential for conversation transcripts like meetings or podcasts.
- tinydiarize aims to be a minimal, interpretable extension of OpenAI's Whisper models that adds speaker diarization with few extra dependencies (inspired by minGPT).
- This uses a finetuned model that adds special tokens to mark speaker changes [1,2,3,4]. It can use both voice and semantic context to tell speakers apart, which is a unique benefit of this approach.
- You can refer to tdrz_dev for a detailed analysis of performance. Note that this is intended to be a prototype/proof-of-concept.
- Experimental support is also added to whisper.cpp so this can run on consumer hardware like MacBooks and iPhones. A tiny change is needed to original inference code (<50 lines), enabling simple and cheap speaker segmentation, compared with conventional approaches.
Demo
You can try it out on other such gems from YouTube using this notebook.
Quickstart
Install ffmpeg
following the original repo, then run:
pip install -e .
whisper --model small.en-tdrz AUDIO
The only change is the small.en-tdrz
model instead of small.en
. That's it! 🎉
What's included?
- Finetuned checkpoint for the
small.en-tdrz
model (located here) and example inference code (relevant edits in [#4] [#11]). This has the same dependencies as the original whisper repo. - Tools for comparison and analysis (under /tdrz_dev):
- A scoring tool to measure and compare accuracy on your own data in an easy to interpret way.
- A reference script to run and compare various diarization pipelines.
- A Jupyter notebook to compare and understand performance in detail.
- See Roadmap for more info.
We aim to demonstrate a starting point enabling anyone (or even OpenAI themselves!) to improve performance and extend support (multilingual, speech translation etc.).
Performance
metric | small.en | small.en-tdrz |
---|---|---|
spk_turn_precision | - | 97.7 |
spk_turn_recall | - | 70.8 |
wer_overall | 11.0 | 10.3 |
wer_speaker_switch | 15.0 | 15.5 |
On a (tiny) benchmark set of 3 earnings calls, tdrz
gets near-perfect speaker turn precision at fairly decent recall. A similar WER is retained as the original model. Not too shabby for a tiny finetuning setup, and <10% extra inference cost!
Refer to tdrz_dev for details on performance analysis and comparisons.
More info
- Whisper
small.en
checkpoints were finetuned on ~100hrs of AMI meetings using HuggingFace Transformers and Datasets. - With some tricks, this could be done relatively cheaply with just 30mins of 1 GPU training starting to produce decent results. Tiny indeed 😊.
- We used helpful tools from pyannote (the OG open-source diarization toolkit) for finetuning data preparation and also analyze its performance.
- We make use of the excellent open-source revdotcom/fstalign tool for scoring and analysis.
Gotchas
Note that this still an early proof-of-concept and there are a few things to be aware of:
- Only the
small.en
English model has been finetuned. - Word-error-rate (WER) is close to original models, although not yet extensively tested. Ad-hoc inspection does show some differences in timestamp behavior (longer segments) or deletion errors. See the notebook under tdrz_dev for details.
- Given a pretty tiny finetuning setup, there's likely a lot of room for further accuracy improvements.
- Only local diarization (segmentation into speaker turns) is handled so far. Extension with global diarization (speaker clustering) is planned for later.
- Stuff is still hacky and subject to change, so hold your horses just yet! 🐎
Roadmap
- inference code & demo
- scoring and analysis tools
- whisper.cpp integration
- reproducible dataprep + finetuning*
- blog post explainer*
- HuggingFace integration
- better LoRa-based
small.en
checkpoint - possibly clustering with NME-SC?
- possibly
large-v2
checkpoint?
* is a pointer to the current state of the repo. Please see https://github.com/akashmjn/tinydiarize/issues/14 for an update on plans. TLDR; things have had to be put on pause :/
References
[1] Joint Speech Recognition and Speaker Diarization via Sequence Transduction [2] Serialized Output Training for End-to-End Overlapped Speech Recognition [3] Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection [4] Adapting Multi-Lingual ASR Models for Handling Multiple Talkers
For information on the underlying Whisper model, please refer to the original documentation (release: 20230308
)
License
Code and model weights are released under the MIT License. See LICENSE for further details.
Citation
If you please to use this in your research, you can cite this work as
@software{mahajan2023tinydiarize,
author = {Mahajan, Akash},
month = {08},
title = {tinydiarize: Minimal extension of Whisper for speaker segmentation with special tokens},
url = {https://github.com/akashmjn/tinyDiarize},
year = {2023}
}