Awesome
<div align="center"> <img src="https://raw.githubusercontent.com/k2-fsa/icefall/master/docs/source/_static/logo.png" width=168> </div>Introduction
The icefall project contains speech-related recipes for various datasets using k2-fsa and lhotse.
You can use sherpa, sherpa-ncnn or sherpa-onnx for deployment with models in icefall; these frameworks also support models not included in icefall; please refer to respective documents for more details.
You can try pre-trained models from within your browser without the need to download or install anything by visiting this huggingface space. Please refer to document for more details.
Installation
Please refer to document for installation.
Recipes
Please refer to document for more details.
ASR: Automatic Speech Recognition
Supported Datasets
More datasets will be added in the future.
Supported Models
The LibriSpeech recipe supports the most comprehensive set of models, you are welcome to try them out.
CTC
- TDNN LSTM CTC
- Conformer CTC
- Zipformer CTC
MMI
- Conformer MMI
- Zipformer MMI
Transducer
- Conformer-based Encoder
- LSTM-based Encoder
- Zipformer-based Encoder
- LSTM-based Predictor
- Stateless Predictor
Whisper
- OpenAi Whisper (We support fine-tuning on AiShell-1.)
If you are willing to contribute to icefall, please refer to contributing for more details.
We would like to highlight the performance of some of the recipes here.
yesno
This is the simplest ASR recipe in icefall
and can be run on CPU.
Training takes less than 30 seconds and gives you the following WER:
[test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]
We provide a Colab notebook for this recipe:
LibriSpeech
Please see RESULTS.md for the latest results.
Conformer CTC
test-clean | test-other | |
---|---|---|
WER | 2.42 | 5.73 |
We provide a Colab notebook to test the pre-trained model:
TDNN LSTM CTC
test-clean | test-other | |
---|---|---|
WER | 6.59 | 17.69 |
We provide a Colab notebook to test the pre-trained model:
Transducer (Conformer Encoder + LSTM Predictor)
test-clean | test-other | |
---|---|---|
greedy_search | 3.07 | 7.51 |
We provide a Colab notebook to test the pre-trained model:
Transducer (Conformer Encoder + Stateless Predictor)
test-clean | test-other | |
---|---|---|
modified_beam_search (beam_size=4 ) | 2.56 | 6.27 |
We provide a Colab notebook to test the pre-trained model:
Transducer (Zipformer Encoder + Stateless Predictor)
WER (modified_beam_search beam_size=4
unless further stated)
- LibriSpeech-960hr
Encoder | Params | test-clean | test-other | epochs | devices |
---|---|---|---|---|---|
Zipformer | 65.5M | 2.21 | 4.79 | 50 | 4 32G-V100 |
Zipformer-small | 23.2M | 2.42 | 5.73 | 50 | 2 32G-V100 |
Zipformer-large | 148.4M | 2.06 | 4.63 | 50 | 4 32G-V100 |
Zipformer-large | 148.4M | 2.00 | 4.38 | 174 | 8 80G-A100 |
- LibriSpeech-960hr + GigaSpeech
Encoder | Params | test-clean | test-other |
---|---|---|---|
Zipformer | 65.5M | 1.78 | 4.08 |
- LibriSpeech-960hr + GigaSpeech + CommonVoice
Encoder | Params | test-clean | test-other |
---|---|---|---|
Zipformer | 65.5M | 1.90 | 3.98 |
GigaSpeech
Conformer CTC
Dev | Test | |
---|---|---|
WER | 10.47 | 10.58 |
Transducer (pruned_transducer_stateless2)
Conformer Encoder + Stateless Predictor + k2 Pruned RNN-T Loss
Dev | Test | |
---|---|---|
greedy_search | 10.51 | 10.73 |
fast_beam_search | 10.50 | 10.69 |
modified_beam_search | 10.40 | 10.51 |
Transducer (Zipformer Encoder + Stateless Predictor)
Dev | Test | |
---|---|---|
greedy_search | 10.31 | 10.50 |
fast_beam_search | 10.26 | 10.48 |
modified_beam_search | 10.25 | 10.38 |
Aishell
TDNN LSTM CTC
test | |
---|---|
CER | 10.16 |
We provide a Colab notebook to test the pre-trained model:
Transducer (Conformer Encoder + Stateless Predictor)
test | |
---|---|
CER | 4.38 |
We provide a Colab notebook to test the pre-trained model:
Transducer (Zipformer Encoder + Stateless Predictor)
WER (modified_beam_search beam_size=4
)
Encoder | Params | dev | test | epochs |
---|---|---|---|---|
Zipformer | 73.4M | 4.13 | 4.40 | 55 |
Zipformer-small | 30.2M | 4.40 | 4.67 | 55 |
Zipformer-large | 157.3M | 4.03 | 4.28 | 56 |
Aishell4
Transducer (pruned_transducer_stateless5)
1 Trained with all subsets:
test | |
---|---|
CER | 29.08 |
We provide a Colab notebook to test the pre-trained model:
TIMIT
TDNN LSTM CTC
TEST | |
---|---|
PER | 19.71% |
We provide a Colab notebook to test the pre-trained model:
TDNN LiGRU CTC
TEST | |
---|---|
PER | 17.66% |
We provide a Colab notebook to test the pre-trained model:
TED-LIUM3
Transducer (Conformer Encoder + Stateless Predictor)
dev | test | |
---|---|---|
modified_beam_search (beam_size=4 ) | 6.91 | 6.33 |
We provide a Colab notebook to test the pre-trained model:
Transducer (pruned_transducer_stateless)
dev | test | |
---|---|---|
modified_beam_search (beam_size=4 ) | 6.77 | 6.14 |
We provide a Colab notebook to test the pre-trained model:
Aidatatang_200zh
Transducer (pruned_transducer_stateless2)
Dev | Test | |
---|---|---|
greedy_search | 5.53 | 6.59 |
fast_beam_search | 5.30 | 6.34 |
modified_beam_search | 5.27 | 6.33 |
We provide a Colab notebook to test the pre-trained model:
WenetSpeech
Transducer (pruned_transducer_stateless2)
Dev | Test-Net | Test-Meeting | |
---|---|---|---|
greedy_search | 7.80 | 8.75 | 13.49 |
fast_beam_search | 7.94 | 8.74 | 13.80 |
modified_beam_search | 7.76 | 8.71 | 13.41 |
We provide a Colab notebook to test the pre-trained model:
Transducer Streaming (pruned_transducer_stateless5)
Dev | Test-Net | Test-Meeting | |
---|---|---|---|
greedy_search | 8.78 | 10.12 | 16.16 |
fast_beam_search | 9.01 | 10.47 | 16.28 |
modified_beam_search | 8.53 | 9.95 | 15.81 |
Alimeeting
Transducer (pruned_transducer_stateless2)
Eval | Test-Net | |
---|---|---|
greedy_search | 31.77 | 34.66 |
fast_beam_search | 31.39 | 33.02 |
modified_beam_search | 30.38 | 34.25 |
We provide a Colab notebook to test the pre-trained model:
TAL_CSASR
Transducer (pruned_transducer_stateless5)
The best results for Chinese CER(%) and English WER(%) respectively (zh: Chinese, en: English):
decoding-method | dev | dev_zh | dev_en | test | test_zh | test_en |
---|---|---|---|---|---|---|
greedy_search | 7.30 | 6.48 | 19.19 | 7.39 | 6.66 | 19.13 |
fast_beam_search | 7.18 | 6.39 | 18.90 | 7.27 | 6.55 | 18.77 |
modified_beam_search | 7.15 | 6.35 | 18.95 | 7.22 | 6.50 | 18.70 |
We provide a Colab notebook to test the pre-trained model:
TTS: Text-to-Speech
Supported Datasets
Supported Models
Deployment with C++
Once you have trained a model in icefall, you may want to deploy it with C++ without Python dependencies.
Please refer to
- https://k2-fsa.github.io/icefall/model-export/export-with-torch-jit-script.html
- https://k2-fsa.github.io/icefall/model-export/export-onnx.html
- https://k2-fsa.github.io/icefall/model-export/export-ncnn.html
for how to do this.
We also provide a Colab notebook, showing you how to run a torch scripted model in k2 with C++. Please see: