Awesome
Vāksañcayaḥ - Sanskrit speech corpus has more than 78 hours of data and contains recordings of 45,953 sentences with a sampling rate of 22 KHz. The content is mainly readings of various texts spanning many Śāstras of Saṃskṛt literature and also includes contemporary stories, radio program, extempore discourse, etc. The summary datasheet associated with this corpus can be accessed here - Link. Please download the corpus from https://www.cse.iitb.ac.in/~asr/.
Environments
- python version: 3.7.3
- Model files
- List of the speakers used in the train, validation, test and out-of-domain-test split are given in the README file of corpus.
- SRILM LM link
- Results for different model
- In-domain test data WER : 21.94 for the best performing model (SLP1 as the script and BPE splits as the LM unit).
- Out-of-domain test data WER for different speakers can be referred to in the paper.
Recipe
This Kaldi recipe is based on subword - Vowel Split and Byte Pair Encoding. For word based we used Wall Street Journal recipe
Training
Download the vowel splitter (This requires the text to be in SLP1 format)
Download the pre-trained model
- Model (SLP word based)
- Model (SLP BPE based)
- Model (SLP vowel split based)
- Model (Devnagari word based)
- Model (Devnagari BPE based)
- Model (Devnagari vowel split based)
Download the processed dataset
- Convert the audio files for testing from .mp3 files to .wav files before testing using the script given with the corpus.
- We used our best performing model(SLP1 as the script and BPE splits as the LM unit) for testing Out-of-domain data.
- In-domain test data link (test.zip)
- Out-of-domain test data link (truetest.zip)
Evaluate
From pre-trained model (SLP vowel split)
./decode.sh test
# | WER : 18.12
./decode.sh truetest
# | WER : 34.88
Publications
Devaraja Adiga and Rishabh Kumar and Amrith Krishna and Preethi Jyothi and Ganesh Ramakrishnan and Pawan Goyal, Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights, In ACL 2021.