Awesome
Vakyansh Open Source Models
- Pretrained ASR Models
- Finetuned ASR Models
- Language Models
- Punctuation Models
- TTS Models
- Gender Classification Model
- Language Identification Models
- Interspeech 2021 ASR Models
<a name="pam"></a>
Pretrained ASR Models
Pretrained Model | Description | Architecture | Hours |
---|---|---|---|
Vakyansh-Conformer-SSL | This model was pre-trained using Nemo toolkit with 34,000 hours unlabeled audio in 39 Indian languages. This includes 15,000 hours of news recordings available on the internet, 10,000 hours of YouTube audios and other audio data. In addition, 9,000 hours of Indian English audio data was taken from NPTEL lectures open sourced by AI4Bharat. <br> This model was trained in collaboration with NVIDIA (NVIDIA Graphics Pvt Ltd). We thank NVIDIA for providing the compute resources to train this model. | Conformer-Large | 34,000 |
CLSRIL-23 | Cross Lingual Speech Representations for Indic Languages, Contains 10,000 hours of training data from 23 Indic Languages. <br> Citation: https://arxiv.org/abs/2107.07402 | wav2vec2-Base | 10,000 |
hindi_pretrained_4kh | Trained on 4200 hours of Hindi Data | wav2vec2-Base | 4,200 |
kannada_pretrained_1400h | Trained on 1400 hours of Kannada data | wav2vec2-XLSR | 1,400 |
<br><br>
<a name="fam"></a>
Finetuned ASR Models
Conformer based models
Language | Pretrained Model | Finetuned Model | Finetuned Hours | Arch |
---|---|---|---|---|
Hindi | Vakyansh Conformer SSL | hindi_large_ssl_2500 | 2,500 h | Large |
Indian English | Vakyansh Conformer SSL | indian_en_large_ssl_700 | 700 h | Large |
Kannada | Vakyansh Conformer SSL | kannada_large_ssl_1000 | 1,000 h | Large |
Punjabi | Vakyansh Conformer SSL | punjabi_large_ssl_500 | 500 h | Large |
Tamil | Vakyansh Conformer SSL | tamil_large_ssl_900 | 900 h | Large |
<br><hr>
wav2vec2 based models
Citation: https://arxiv.org/abs/2203.16512
<br><br>
<a name="lm"></a>
Language Models
Language models integrate with finetuned models.
Dataset Credits: We thanks AI4Bharat for open sourcing the Indic-Corp dataset. Link. We modified the original data by tokenizing and removing duplicates.
Domain Specific Language Models
Language | Type | Domain | Lexicon | LM | Text Corpus |
---|---|---|---|---|---|
English | kenlm 5-gram | Biomedical | bio_lexicon | bio_lm | bio_lm_eng_text |
<br><br>
<a name="pm"></a>
Punctuation Models
Language | Model | Data |
---|---|---|
Hindi | hi.zip | hindi_data |
Assamese | as.zip | assamese_data |
Bengali | bn.zip | bengali_data |
Gujarati | gu.zip | gujarati_data |
Kannada | kn.zip | kannada_data |
Malayalam | ml.zip | malayalam_data |
Marathi | mr.zip | marathi_data |
Odia | or.zip | odia_data |
Punjabi | pa.zip | punjabi_data |
Tamil | ta.zip | tamil_data |
Telugu | te.zip | telugu_data |
Dataset Credits: We thank AI4Bharat for open sourcing the Indic-Corp dataset. Link. We modified the original data by tokenizing and removing duplicates.
<br><br>
<a name="tts"></a>
TTS Models
Below models are trained using Glow TTS and hifi GAN combination.
Language | Language Code | Gender | glow ckpt | hifi-gan ckpt |
---|---|---|---|---|
Hindi | hi | Female | hi_0_glow | hi_0_hifi |
Hindi | hi | Male | hi_1_glow | hi_1_hifi |
Kannada | kn | Female | kn_0_glow | kn_0_1_hifi |
Kannada | kn | Male | kn_1_glow | kn_0_1_hifi |
Tamil | ta | Female | ta_0_glow | ta_0_1_hifi |
Tamil | ta | Male | ta_1_glow | ta_0_1_hifi |
Telugu | te | Female | te_0_glow | te_0_1_hifi |
Telugu | te | Male | te_1_glow | te_0_1_hifi |
Odia | or | Female | or_0_glow | or_0_1_hifi |
Odia | or | Male | or_1_glow | or_0_1_hifi |
Malayalam | ml | Female | ml_0_glow | ml_0_hifi |
Malayalam | ml | Male | ml_1_glow | ml_1_hifi |
Marathi | mr | Female | mr_0_glow | mr_1_hifi |
Gujarati | gu | Male | gu_0_glow | gu_0_hifi |
Bengali | bn | Female | bn_0_glow | bn_0_1_hifi |
Bengali | bn | Male | bn_1_glow | bn_0_1_hifi |
English | en | Female | en_0_glow | en_0_hifi |
English | en | Male | en_1_glow | en_1_hifi |
Dataset Credits: We thanks IITM for open sourcing Indic-TTS dataset. Link
<br><br>
<a name="gcm"></a>
Gender Classification Model
Type | Model Type | Model |
---|---|---|
Gender Classification | SVC | Model |
<br><br> <a name="lim"></a>
Language Identification Models
Type | Model |
---|---|
Hindi_vs_Others | Model |
Tamil_vs_Others | Model |
<br><br>
<a name="iam"></a>
Interspeech 2021 ASR Models
Language | Pretrained Model | Finetuned Model | Dictionary | Single Model for Inference |
---|---|---|---|---|
Telugu | CLSRIL-23 | te_40h_interspeech | dict | telugu_infer_interspeech |
Tamil | CLSRIL-23 | ta_40h_interspeech | dict | tamil_infer_interspeech |
Gujarati | CLSRIL-23 | gu_40h_interspeech | dict | gujarati_infer_interspeech |
Hinglish | CLSRIL-23 | hinglish_interspeech | dict | hinglish_infer_interspeech |
<br><br>
Citation
If you use any of our resources, please cite the following article:
@misc{chadha2022vakyansh,
title={Vakyansh: ASR Toolkit for Low Resource Indic languages},
author={Harveen Singh Chadha and Anirudh Gupta and Priyanshi Shah and Neeraj Chhimwal and Ankur Dhuriya and Rishabh Gaur and Vivek Raghavan},
year={2022},
eprint={2203.16512},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
If you use the pretrained model (CLSRIL-23) please cite the following article:
@misc{gupta2021clsril23,
title={CLSRIL-23: Cross Lingual Speech Representations for Indic Languages},
author={Anirudh Gupta and Harveen Singh Chadha and Priyanshi Shah and Neeraj Chimmwal and Ankur Dhuriya and Rishabh Gaur and Vivek Raghavan},
year={2021},
eprint={2107.07402},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
<hr>