Awesome
<p align="center"> <br> <img src="docs/images/3D-Speaker-logo.png" width="400"/> <br> <p> <div align="center"> <!-- [![Documentation Status](https://readthedocs.org/projects/easy-cv/badge/?version=latest)](https://easy-cv.readthedocs.io/en/latest/) --><a href=""><img src="https://img.shields.io/badge/OS-Linux-orange.svg"></a> <a href=""><img src="https://img.shields.io/badge/Python->=3.8-aff.svg"></a> <a href=""><img src="https://img.shields.io/badge/Pytorch->=1.10-blue"></a>
</div><strong>3D-Speaker</strong> is an open-source toolkit for single- and multi-modal speaker verification, speaker recognition, and speaker diarization. All pretrained models are accessible on ModelScope. Furthermore, we present a large-scale speech corpus also called 3D-Speaker-Dataset to facilitate the research of speech representation disentanglement.
Benchmark
The EER results on VoxCeleb, CNCeleb and 3D-Speaker datasets for fully-supervised speaker verification.
Model | Params | VoxCeleb1-O | CNCeleb | 3D-Speaker |
---|---|---|---|---|
Res2Net | 4.03 M | 1.56% | 7.96% | 8.03% |
ResNet34 | 6.34 M | 1.05% | 6.92% | 7.29% |
ECAPA-TDNN | 20.8 M | 0.86% | 8.01% | 8.87% |
ERes2Net-base | 6.61 M | 0.84% | 6.69% | 7.21% |
CAM++ | 7.2 M | 0.65% | 6.78% | 7.75% |
ERes2NetV2 | 17.8M | 0.61% | 6.14% | 6.52% |
ERes2Net-large | 22.46 M | 0.52% | 6.17% | 6.34% |
The DER results on public and internal multi-speaker datasets for speaker diarization.
Test | DER | pyannote.audio | DiariZen_WavLM |
---|---|---|---|
Aishell-4 | 10.30% | 12.2% | 11.7% |
Alimeeting | 19.73% | 24.4% | 17.6% |
AMI_SDM | 21.76% | 22.4% | 15.4% |
VoxConverse | 11.75% | 11.3% | 28.39% |
Meeting-CN_ZH-1 | 18.91% | 22.37% | 32.66% |
Meeting-CN_ZH-2 | 12.78% | 17.86% | 18% |
Quickstart
Install 3D-Speaker
git clone https://github.com/modelscope/3D-Speaker.git && cd 3D-Speaker
conda create -n 3D-Speaker python=3.8
conda activate 3D-Speaker
pip install -r requirements.txt
Running experiments
# Speaker verification: ERes2NetV2 on 3D-Speaker dataset
cd egs/3dspeaker/sv-eres2netv2/
bash run.sh
# Speaker verification: CAM++ on 3D-Speaker dataset
cd egs/3dspeaker/sv-cam++/
bash run.sh
# Speaker verification: ECAPA-TDNN on 3D-Speaker dataset
cd egs/3dspeaker/sv-ecapa/
bash run.sh
# Self-supervised speaker verification: SDPN on VoxCeleb dataset
cd egs/voxceleb/sv-sdpn/
bash run.sh
# Audio and multimodal Speaker diarization:
cd egs/3dspeaker/speaker-diarization/
bash run_audio.sh
bash run_video.sh
# Language identification
cd egs/3dspeaker/language-idenitfication
bash run.sh
Inference using pretrained models from Modelscope
All pretrained models are released on Modelscope.
# Install modelscope
pip install modelscope
# ERes2Net trained on 200k labeled speakers
model_id=iic/speech_eres2net_sv_zh-cn_16k-common
# ERes2NetV2 trained on 200k labeled speakers
model_id=iic/speech_eres2netv2_sv_zh-cn_16k-common
# CAM++ trained on 200k labeled speakers
model_id=iic/speech_campplus_sv_zh-cn_16k-common
# Run CAM++ or ERes2Net inference
python speakerlab/bin/infer_sv.py --model_id $model_id
# Run batch inference
python speakerlab/bin/infer_sv_batch.py --model_id $model_id --wavs $wav_list
# SDPN trained on VoxCeleb
model_id=iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k
# Run SDPN inference
python speakerlab/bin/infer_sv_ssl.py --model_id $model_id
# Run diarization inference
python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir $out_dir
# Enable overlap detection
python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir $out_dir --include_overlap --hf_access_token $hf_access_token
Overview of Content
-
Supervised Speaker Verification
-
CAM++, ERes2Net, ERes2NetV2, ECAPA-TDNN, ResNet and Res2Net training recipes on 3D-Speaker.
-
CAM++, ERes2Net, ERes2NetV2, ECAPA-TDNN, ResNet and Res2Net training recipes on VoxCeleb.
-
CAM++, ERes2Net, ERes2NetV2, ECAPA-TDNN, ResNet and Res2Net training recipes on CN-Celeb.
-
-
Self-supervised Speaker Verification
-
Speaker Diarization
- Speaker diarization inference recipes which comprise multiple modules, including overlap detection[optional], voice activity detection, speech segmentation, speaker embedding extraction, and speaker clustering.
-
Language Identification
- Language identification training recipes on 3D-Speaker.
-
3D-Speaker Dataset
- Dataset introduction and download address: 3D-Speaker <br>
- Related paper address: 3D-Speaker
What‘s new :fire:
- [2024.12] Update diarization recipes and add results on multiple diarization benchmarks.
- [2024.8] Releasing ERes2NetV2 and ERes2NetV2_w24s4ep4 pretrained models trained on 200k-speaker datasets.
- [2024.5] Releasing SDPN model and X-vector model training and inference recipes for VoxCeleb.
- [2024.5] Releasing visual module and semantic module training recipes.
- [2024.4] Releasing ONNX Runtime and the relevant scripts for inference.
- [2024.4] Releasing ERes2NetV2 model with lower parameters and faster inference speed on VoxCeleb datasets.
- [2024.2] Releasing language identification integrating phonetic information recipes for more higher recognition accuracy.
- [2024.2] Releasing multimodal diarization recipes which fuses audio and video image input to produce more accurate results.
- [2024.1] Releasing ResNet34 and Res2Net model training and inference recipes for 3D-Speaker, VoxCeleb and CN-Celeb datasets.
- [2024.1] Releasing large-margin finetune recipes in speaker verification and adding diarization recipes.
- [2023.11] ERes2Net-base pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.
- [2023.10] Releasing ECAPA model training and inference recipes for three datasets.
- [2023.9] Releasing RDINO model training and inference recipes for CN-Celeb.
- [2023.8] Releasing CAM++, ERes2Net-Base and ERes2Net-Large benchmarks in CN-Celeb.
- [2023.8] Releasing ERes2Net annd CAM++ in language identification for Mandarin and English.
- [2023.7] Releasing CAM++, ERes2Net-Base, ERes2Net-Large pretrained models trained on 3D-Speaker.
- [2023.7] Releasing Dialogue Detection and Semantic Speaker Change Detection in speaker diarization.
- [2023.7] Releasing CAM++ in language identification for Mandarin and English.
- [2023.6] Releasing 3D-Speaker dataset and its corresponding benchmarks including ERes2Net, CAM++ and RDINO.
- [2023.5] ERes2Net and CAM++ pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.
Contact
If you have any comment or question about 3D-Speaker, please contact us by
- email: {yfchen97, wanghuii}@mail.ustc.edu.cn, {dengchong.d, zsq174630, shuli.cly}@alibaba-inc.com
License
3D-Speaker is released under the Apache License 2.0.
Acknowledge
3D-Speaker contains third-party components and code modified from some open-source repos, including: <br> Speechbrain, Wespeaker, D-TDNN, DINO, Vicreg, TalkNet-ASD , Ultra-Light-Fast-Generic-Face-Detector-1MB, pyannote.audio
Citations
If you find this repository useful, please consider giving a star :star: and citation :t-rex::
@article{chen20243d,
title={3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization},
author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and others},
booktitle={ICASSP},
year={2025}
}