Home

Awesome

ReDimNet

This is an official implementation of a neural network architecture presented in the paper Reshape Dimensions Network for Speaker Recognition.

<p align="center"> <img src="src/comparison_plot.png" alt="Sample" width="1000"> <p align="center"> <em>Speaker Recognition NN architectures comparison (2024)</em> </p> </p>

Update

Introduction

We introduce Reshape Dimensions Network (ReDimNet), a novel neural network architecture for spectrogram (audio) processing, specifically for extracting utterance-level speaker representations. ReDimNet reshapes dimensionality between 2D feature maps and 1D signal representations, enabling the integration of 1D and 2D blocks within a single model. This architecture maintains the volume of channel-timestep-frequency outputs across both 1D and 2D blocks, ensuring efficient aggregation of residual feature maps. ReDimNet scales across various model sizes, from 1 to 15 million parameters and 0.5 to 20 GMACs. Our experiments show that ReDimNet achieves state-of-the-art performance in speaker recognition while reducing computational complexity and model size compared to existing systems.

<p align="center"> <img src="src/redimnet_scheme.png" alt="Sample" width="1000"> <p align="center"> <em>ReDimNet architecture</em> </p> </p>

Metrics

ModelParamsGMACsLMAS-NormVox1-O EER(%)Vox1-E EER(%)Vox1-H EER(%)
⬦ ReDimNet-B01.0M0.431.161.252.20
⬥ ReDimNet-B01.071.182.01
NeXt-TDNN-l (C=128,B=3)1.6M0.29*1.101.242.12
NeXt-TDNN (C=128,B=3)1.9M0.35*1.031.171.98
⬦ ReDimNet-B12.2M0.540.850.971.73
⬥ ReDimNet-B10.730.891.57
ECAPA (C=512)6.4M1.050.941.212.20
NeXt-TDNN-l (C=256,B=3)6.0M1.13*0.811.041.86
CAM++7.2M1.150.710.851.66
NeXt-TDNN (C=256,B=3)7.1M1.35*0.791.041.82
⬦ ReDimNet-B24.7M0.900.570.761.32
⬥ ReDimNet-B20.520.741.27
ECAPA (C=1024)14.9M2.670.981.132.09
DF-ResNet564.5M2.660.961.091.99
Gemini DF-ResNet604.1M2.50*0.941.051.80
⬦ ReDimNet-B33.0M3.000.500.731.33
⬥ ReDimNet-B30.470.691.23
ResNet346.6M4.550.820.931.68
Gemini DF-ResNet1146.5M5.000.690.861.49
⬦ ReDimNet-B46.3M4.800.510.681.26
⬥ ReDimNet-B40.440.641.17
Gemini DF-ResNet1839.2M8.250.600.811.44
DF-ResNet23312.3M11.170.580.761.44
⬦ ReDimNet-B59.2M9.870.430.611.08
⬥ ReDimNet-B50.390.591.05
ResNet29323.8M28.100.530.711.30
ECAPA227.1M187.00*0.440.621.15
⬦ ReDimNet-B615.0M20.270.400.551.05
⬥ ReDimNet-B60.370.531.00

* - means values have been estimated.

Usage

Requirement

PyTorch>=2.0

Examples

import torch

# To load pretrained on vox2 model without Large-Margin finetuning
model = torch.hub.load('IDRnD/ReDimNet', 'b0', pretrained=True, finetuned=False)

# To load pretrained on vox2 model with Large-Margin finetuning:
model = torch.hub.load('IDRnD/ReDimNet', 'b0', pretrained=True, finetuned=True)

Citation

If you find this work or code is helpful in your research, please cite (will be updated after Interspeech 2024 publication):

@inproceedings{yakovlev24_interspeech,
  title     = {Reshape Dimensions Network for Speaker Recognition},
  author    = {Ivan Yakovlev and Rostislav Makarov and Andrei Balykin and Pavel Malov and Anton Okhotnikov and Nikita Torgashov},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {3235--3239},
  doi       = {10.21437/Interspeech.2024-2116},
}

Acknowledgements

For training model we used wespeaker pipeline. We ported some layers from transformers.