Home

Awesome

Mailing list : test Mailing list : test

Donations Backers Sponsors License: CC BY-NC 4.0

header)

Russian Open Speech To Text (STT/ASR) Dataset

Arguably the largest public Russian STT dataset up to date:

Prove us wrong! Open issues, collaborate, submit a PR, contribute, share your datasets! Let's make STT in Russian (and more) as open and available as CV models.

Important - assume that ё everywhere is replaced with е.

Planned releases:

Table of contents

Dataset composition

DatasetUtterancesHoursGBSecs/charsCommentAnnotationQuality/noise
radio_v47,603,19210,4301,1955s / 68RadioAlign (*)95% / crisp
public_speech1,700,0602,7093016s / 79Public speechAlign (*)95% / crisp
audiobook_21,149,4041,5111625s / 56BooksAlign (*)95% / crisp
radio_2651,6451,4391548s / 110RadioAlign (*)95% / crisp
public_youtube11201,410,9791,1042373s / 34YoutubeSubtitles95% / ~crisp
public_youtube700759,483701753s / 43YoutubeSubtitles95% / ~crisp
tts_russian_addresses1,741,838754812s / 20AddressesTTS 4 voices100% / crisp
asr_public_phone_calls_2603,797601664s / 37Phone callsASR70% / noisy
public_youtube1120_hq369,245291313s / 37YouTube HQSubtitles95% / ~crisp
asr_public_phone_calls_1233,868211233s / 29Phone callsASR70% / noisy
radio_v4_add92,679157186s / 80RadioAlign (*)95% / crisp
asr_public_stories_278,1867894s / 43BooksASR80% / crisp
asr_public_stories_146,1423843s / 30BooksASR80% / crisp
public_series_120,2431723s / 38YoutubeSubtitles95% / ~crisp
asr_calls_2_val12,9507,722s / 34Phone callsManual annotation99% / crisp
public_lecture_16,803613s / 47LecturesSubtitles95% / crisp
buriy_audiobooks_2_val7,8504,912s / 31BooksManual annotation99% / crisp
public_youtube700_val7,3114,512s / 35YoutubeManual annotation99% / crisp
Total16,513,202‬20,1082,369

Updates

Update 2021-06-04

Added Zenodo direct link mirrors as well.

Update 2020-09-23

Now hosting a torrent via aria2c as well. Please use aria2c to download as well.

Update 2020-06-13

Now featured via Azure Datasets:

Update 2020-05-09

Legacy links and torrents deprecated

Update 2020-05-04

Opus direct links

Update 2020-05-04

Migration to OPUS

Update 2020-02-07

Temporarily Deprecated Direct MP3 Links:

Update 2019-11-04

New train datasets added:

<details> <summary>Click to expand</summary>

Update 2019-06-28

New train datasets added:

- 1,439 hours radio_2;
- 1,104 hours public_youtube1120;
- 291 hours public_youtube1120_hq;

New validation datasets added:

- 8 hours asr_calls_2_val;
- 5 hours buriy_audiobooks_2_val;
- 5 hours public_youtube700_val;

Update 2019-05-19

Also shared a wav version via torrent.

Update 2019-05-13

Added the forgotten txt files to mp3 archives. Updating the torrent.

Update 2019-05-12

Torrent created and uploaded to academictorrents.

Update 2019-05-10

Quickly converted the dataset to MP3 thanks to the community! Waiting for our account for academic torrents to be approved. v0.4 will boast MP3 download links.

Update 2019-05-07 Help needed!

If you want to support the project, you can:

We are converting the dataset to MP3 now. Please contact us using the below contacts, if you would like to help.

</details>

Downloads

Via torrent

You can download separate files via torrent.

Looks like that due to large chunk size, most conversional torrent clients just fail silently. No problem (re-calculating the torrent takes much time, and some people have downloaded it already), use aria2c:

apt update
apt install aria2
# list the torrent files
aria2c --show-files ru_open_stt_wav_v10.torrent
# download only one file
aria2c --select-file=4 ru_open_stt_wav_v10.torrent
# for more options visit
# https://aria2.github.io/manual/en/html/aria2c.html#basic-options
# https://aria2.github.io/manual/en/html/aria2c.html#bittorrent-metalink-options
# https://aria2.github.io/manual/en/html/aria2c.html#bittorrent-specific-options

If you are using Windows, you may use Linux subsystem to run these commands.

Links

DatasetGB, wavGB, archiveArchiveSourceManifest
Train
radio_v41059176opus, txtRadiomanifest
public_speech25747.4opus, txtInternet + alignmentmanifest
radio_v4_add15.72.8opus, txtRadiomanifest
5% of radio_v4 + public_speech-11.4opus+txt mirror-manifest
audiobook_216225.8opus+txt mirrorInternet + alignmentmanifest
radio_215424.6opus+txt mirrorRadiomanifest
public_youtube112023719.0opus+txt mirrorYouTube videosmanifest
asr_public_phone_calls_2669.4opus+txt mirrorInternet + ASRmanifest
public_youtube1120_hq314.9opus+txt mirrorYouTube videosmanifest
asr_public_stories_291.4opus+txt mirrorInternet + alignmentmanifest
tts_russian_addresses_rhvoice_4voices80.912.9opus+txt mirrorTTSmanifest
public_youtube70075.012.2opus+txt mirrorYouTube videosmanifest
asr_public_phone_calls_122.73.2opus+txt mirrorInternet + ASRmanifest
asr_public_stories_14.10.7opus+txt mirrorPublic storiesmanifest
public_series_11.90.3opus+txt mirrorPublic seriesmanifest
public_lecture_10.70.1opus+txt mirrorInternet + manualmanifest
Val
asr_calls_2_val20.8wav+txt mirrorInternetmanifest
buriy_audiobooks_2_val10.5wav+txt mirrorBooks + manualmanifest
public_youtube700_val20.13wav+txt mirrorYouTube videos + manualmanifest
Total2,186354

Download instructions

End to end

download.sh

or

download.py with this config file. Please check the config first.

Manually

  1. Download each dataset separately:

Via wget

wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

For multi-threaded downloads use aria2 with -x flag, i.e.

aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

If necessary, merge chunks like this:

cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz
  1. Download the meta data and manifests for each dataset:
  2. Merge files (where applicable), unpack and enjoy!

Manually (using AzCopy) (2022-03-10)

When downloading large files from Azure wget downlaod may restart so often that it is impossible to download the largest file archives/radio_v4_manifest.tar.gz (176GB).

In that case you can use AzCopy util.

Instructions to download files using it are here. For the large file mentioned earlier you need to run

azcopy[.exe] copy https://azureopendatastorage.blob.core.windows.net/openstt/ru_open_stt_opus/archives/radio_v4_manifest.tar.gz radio_v4_manifest.tar.gz

command if you want to download file to the same folder where azcopy[.exe] is located.

Annotation methodology

The dataset is compiled using open domain sources. Some audio types are annotated automatically and verified statistically / using heuristics.

Audio normalization

All files are normalized for easier / faster runtime augmentations and processing as follows:

On disk DB methodology

Each audio file is hashed. Its hash is used to create a folder hierarchy for more optimal fs operation.

target_format = 'wav'
wavb = wav.tobytes()

f_hash = hashlib.sha1(wavb).hexdigest()

store_path = Path(root_folder,
                  f_hash[0],
                  f_hash[1:3],
                  f_hash[3:15]+'.'+target_format)

Helper functions

Use helper functions from here for easier work with manifest files.

Read manifests

<details><summary>See example</summary> <p>
from utils.open_stt_utils import read_manifest

manifest_df = read_manifest('path/to/manifest.csv')
</p> </details>

Merge, check and save manifests

<details><summary>See example</summary> <p>
from utils.open_stt_utils import (plain_merge_manifests,
                                  check_files,
                                  save_manifest)
train_manifests = [
 'path/to/manifest1.csv',
 'path/to/manifest2.csv',
]
train_manifest = plain_merge_manifests(train_manifests,
                                        MIN_DURATION=0.1,
                                        MAX_DURATION=100)
check_files(train_manifest)
save_manifest(train_manifest,
             'my_manifest.csv')
</p> </details>

How to open opus

The best efficient way to read opus files in python (the we know of) that does incur any significant overhead (i.e. launching subprocesses, using a daisy chain of libraries with sox, FFMPEG etc) is to use pysoundfile (a python CFFI wrapper around libsoundfile).

When this solution was being researched the community had been waiting for a major libsoundfile release for years. Opus support has been implemented some time ago upstream, but it has not been properly released. Therefore we opted for a custom build + monkey patching.

At the time when you read / use this - probably there will be decent / proper builds of libsndfile.

Building libsoundfile

apt-get update
apt-get install cmake autoconf autogen automake build-essential libasound2-dev \
libflac-dev libogg-dev libtool libvorbis-dev libopus-dev pkg-config -y

cd /usr/local/lib
git clone https://github.com/erikd/libsndfile.git
cd libsndfile
git reset --hard 49b7d61
mkdir -p build && cd build

cmake .. -DBUILD_SHARED_LIBS=ON
make && make install
cmake --build .

Patched pysoundfile wrapper

Install pysoundfile pip install soundfile

import utils.soundfile_opus as sf

path = 'path/to/file.opus`
audio, sr = sf.read(path, dtype='int16')

Known issues

When you attempt writing large files (90-120s), there is an upstream bug in libsndfile that prevents writing such files with opus / vorbis. Most likely will be fixed by major libsndfile releases.

Contacts

Please contact us here or just create a GitHub issue!

Authors (in alphabetic order):

Acknowledgements

This repo would not be possible without these people:

Kudos!

FAQ

0. Why not MP3? MP3 encoding / decoding - DEPRECATED

Encoding

Mostly we used pydub (via ffmpeg) or sox (much much faster way) to convert to MP3. We omitted blank files (YouTube mostly). We used the following parameters:

Usually 128-192 kbps is enough for music with sr of 44 kHz, 64-96 is enough for speech. But here we have mono, 16 kHz and usually only one speaker. So 32 kbps was a good choice. We did not use other formats like .ogg, because .mp3 is much more popular.

<details><summary>See example `pydub`</summary> <p>
from pydub import AudioSegment

sound = AudioSegment.from_file(temp_path,
                               format="wav")

file_handle = sound.export(store_mp3_path,
                           format="mp3",
                           parameters =["-ar", "{}".format(str(16000)),"-ac", "1"],
                           bitrate="{}k".format(str(32)))
</p> </details> <details><summary>See example `sox`</summary> <p>
import subprocess
cmd = 'sox "{}" -C 32.01 -c 1 "{}"'.format(
            wav_path,
            store_mp3_path)
    
res = subprocess.call([cmd], shell=True)

if res != 0:
    print('Problems with {}'.format(wav_path))
</p> </details>

Decoding

It is up to you, but to save space and spare CPU during training, I would suggest the following pipeline to extract the files:

<details><summary>See example</summary> <p>
# you can also use pydub, torchaudio, sox or whatever
# we ended up using scipy for speed
# this example also includes hashing step which is not necessary
import librosa
import hashlib
import numpy as np
from pathlib import Path
from scipy.io import wavfile

def save_wav_diskdb(wav,
                    root_folder='../data/ru_open_stt/',
                    target_sr=16000):
    assert type(wav) == np.ndarray
    assert wav.dtype == np.dtype('int16')
    assert len(wav.shape)==1

    target_format = 'wav'
    wavb = wav.tobytes()

    # f_path = Path(audio_path)
    f_hash = hashlib.sha1(wavb).hexdigest()

    store_path = Path(root_folder,
                      f_hash[0],
                      f_hash[1:3],
                      f_hash[3:15]+'.'+target_format)

    store_path.parent.mkdir(parents=True,
                            exist_ok=True)

    wavfile.write(filename=str(store_path),
                  rate=target_sr,
                  data=wav)

    return str(store_path)

root_folder = '../data/'
# save to int16, mono, 16 kHz to save space
target_dtype = np.dtype('int16')
target_sr = 16000
# librosa reads mp3
wav, sr = librosa.load(source_mp3_path,
                       mono=True,
                       sr=target_sr)

# librosa converts to float32 by default
wav = (wav * 32767).astype(target_dtype) # cast to int

wav_path = save_wav_diskdb(wav,
                           root_folder=root_folder,
                           target_sr=target_sr)
</p> </details>

Why not OGG/ Opus - DEPRECATED

Even though OGG / Opus is considered to be better for speech with higher compression, we opted for a more conventional well known format.

Also LPC net codec boasts ultra-low bitrate speech compression as well. But we decided to opt for a more familiar format to avoid worry about actually losing signal in compression.

1. Issues with reading files

Maybe try this approach:

<details><summary>See example</summary> <p>
from scipy.io import wavfile

sample_rate, sound = wavfile.read(path)

abs_max = np.abs(sound).max()
sound = sound.astype('float32')
if abs_max>0:
    sound *= 1/abs_max
</p> </details>

2. Why share such dataset?

We are not altruists, life just is not a zero sum game.

Consider the progress in computer vision, that was made possible by:

STT does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. Ultimately it leads to worse-off situation for the general community.

3. Known issues with the dataset to be fixed

4. Why migrate to OPUS?

After extensive testing, both during training and validation, we confirmed that converting 16kHz int16 data to OPUS does not at the very least degrade quality.

Also designed for speech, OPUS even at default compression rates takes less space than MP3 and does not introduce artefacts.

Some people even reported quality improvements when training using OPUS.

License

сс-nc-by-license

CC-BY-NC and commercial usage available after agreement with dataset authors.

Donations

Donate (each coffee pays for several full downloads) or via open_collective or just use our DO referral link to help.

Commerical inquiries

Further reading

English

Chinese

Russian