Home

Awesome

🇺🇦 Speech Recognition & Synthesis for Ukrainian

Overview

This repository collects links to models, datasets, and tools for Ukrainian Speech-to-Text and Text-to-Speech projects.

Community

🎤 Speech-to-Text

📦 Implementations

<details><summary>wav2vec2-bert</summary> <p> </p> </details> <details><summary>wav2vec2</summary> <p>

You can check demos out here: https://github.com/egorsmkv/wav2vec2-uk-demo

</p> </details> <details><summary>data2vec</summary> <p> </p> </details> <details><summary>Citrinet</summary> <p> </p> </details> <details><summary>ContextNet</summary> <p> </details> <details><summary>FastConformer</summary> <p> </details> <details><summary>Squeezeformer</summary> <p> </details> <details><summary>Conformer-CTC</summary> <p> </details> <details><summary>VOSK</summary> <p>

Note: VOSK models are licensed under Apache License 2.0.

</p> </details> <details><summary>DeepSpeech</summary> <p> </p> </details> <details><summary>M-CTC-T</summary> <p> </p> </details> <details><summary>whisper</summary> <p> </p> </details> <details><summary>Flashlight</summary> <p> </p> </details>

📊 Benchmarks

This benchmark uses Common Voice 10 test split.

wav2vec2-bert

ModelWERCERAccuracy, %WER<sup>+LM</sup>CER<sup>+LM</sup>Accuracy<sup>+LM</sup>, %
Yehor/w2v-bert-2.0-uk0.07270.015192.73%0.06550.013993.45%

wav2vec2

ModelWERCERAccuracy, %WER<sup>+LM</sup>CER<sup>+LM</sup>Accuracy<sup>+LM</sup>, %
Yehor/wav2vec2-xls-r-1b-uk-with-lm0.18070.031781.93%0.11930.021888.07%
Yehor/wav2vec2-xls-r-1b-uk-with-binary-news-lm0.18070.031781.93%0.09970.019190.03%
Yehor/wav2vec2-xls-r-300m-uk-with-lm0.29060.054870.94%0.1720.035582.8%
Yehor/wav2vec2-xls-r-300m-uk-with-news-lm0.20270.036579.73%0.09290.01990.71%
Yehor/wav2vec2-xls-r-300m-uk-with-wiki-lm0.20270.036579.73%0.10450.020889.55%
Yehor/wav2vec2-xls-r-base-uk-with-small-lm0.44410.097555.59%0.28780.071171.22%
robinhad/wav2vec2-xls-r-300m-uk0.27360.053772.64%---
arampacha/wav2vec2-xls-r-1b-uk0.16520.029383.48%0.09450.017590.55%

Citrinet

lm-4gram-500k is used as the LM

ModelWERCERAccuracy, %WER<sup>+LM</sup>CER<sup>+LM</sup>Accuracy<sup>+LM</sup>, %
nvidia/stt_uk_citrinet_1024_gamma_0_250.04320.009495.68%0.03520.007996.48%
neongeckocom/stt_uk_citrinet_512_gamma_0_250.07460.01692.54%0.05630.012894.37%

ContextNet

ModelWERCERAccuracy, %
theodotus/stt_uk_contextnet_5120.06690.014593.31%

FastConformer P&C

This model supports text punctuation and capitalization

ModelWERCERAccuracy, %WER<sup>+P&C</sup>CER<sup>+P&C</sup>Accuracy<sup>+P&C</sup>, %
theodotus/stt_ua_fastconformer_hybrid_large_pc0.04000.010296.00%0.07100.016792.90%

Squeezeformer

lm-4gram-500k is used as the LM

ModelWERCERAccuracy, %WER<sup>+LM</sup>CER<sup>+LM</sup>Accuracy<sup>+LM</sup>, %
theodotus/stt_uk_squeezeformer_ctc_xs0.10780.022989.22%0.07770.017492.23%
theodotus/stt_uk_squeezeformer_ctc_sm0.0820.017591.8%0.06050.014293.95%
theodotus/stt_uk_squeezeformer_ctc_ml0.05910.012694.09%0.04510.010595.49%

Flashlight

lm-4gram-500k is used as the LM

ModelWERCERAccuracy, %WER<sup>+LM</sup>CER<sup>+LM</sup>Accuracy<sup>+LM</sup>, %
Flashlight Conformer0.19150.024480.85%0.09070.019890.93%

data2vec

ModelWERCERAccuracy, %
robinhad/data2vec-large-uk0.31170.073168.83%

VOSK

ModelWERCERAccuracy, %
v30.53250.387846.75%

m-ctc-t

ModelWERCERAccuracy, %
speechbrain/m-ctc-t-large0.570.109443%

whisper

ModelWERCERAccuracy, %
tiny0.63080.185936.92%
base0.5210.140847.9%
small0.30570.076469.43%
medium0.18730.04481.27%
large (v1)0.16420.039383.58%
large (v2)0.13720.031886.28%

Fine-tuned version for Ukrainian:

ModelWERCERAccuracy, %
small0.27040.056572.96%
large0.24820.05575.18%

If you want to fine-tune a Whisper model on own data, then use this repository: https://github.com/egorsmkv/whisper-ukrainian

DeepSpeech

ModelWERCERAccuracy, %
v0.50.70250.200929.75%

📖 Development

📚 Datasets

Compiled dataset from different open sources + Companies + Community = 188.31GB / ~1200 hours 💪

Voice of America (398 hours)

FLEURS

YODAS2

Companies

Ukrainian podcasts

Cleaned Common Voice 10 (test set)

Noised Common Voice 10

Community

Other

⭐ Related works

Language models

Inverse Text Normalization:

Text Enhancement

📢 Text-to-Speech

Test sentence with stresses:

К+ам'ян+ець-Под+ільський - м+істо в Хмельн+ицькій +області Укра+їни, ц+ентр Кам'ян+ець-Под+ільської міськ+ої об'+єднаної територі+альної гром+ади +і Кам'ян+ець-Под+ільського рай+ону.

Without stresses:

Кам'янець-Подільський - місто в Хмельницькій області України, центр Кам'янець-Подільської міської об'єднаної територіальної громади і Кам'янець-Подільського району.

📦 Implementations

<details><summary>P-Flow TTS</summary> <p>

https://github.com/egorsmkv/speech-recognition-uk/assets/7875085/18cfc074-f8a1-4842-90b6-9503d0bb7250

</p> </details> <details><summary>RAD-TTS</summary> <p>

https://user-images.githubusercontent.com/7875085/206881140-bf8c09e7-5553-43d9-8807-065c36b2904b.mp4

</p> </details> <details><summary>Coqui TTS</summary> <p>

https://user-images.githubusercontent.com/5759207/167480982-275d8ca0-571f-4d21-b8d7-3776b3091956.mp4

</p> </details> <details><summary>Neon TTS</summary> <p>

https://user-images.githubusercontent.com/96498856/170762023-d4b3f6d7-d756-4cb7-89de-dc50e9049b96.mp4

</p> </details> <details><summary>FastPitch</summary> <p> </p> </details> <details><summary>Balacoon TTS</summary> <p>

https://github.com/clementruhm/speech-recognition-uk/assets/87281103/a13493ce-a5e5-4880-8b72-42b02feeee50

</p> </details>

📚 Datasets

⭐ Related works

Accentors

Misc