Home

Awesome

GigaSpeech

This is the official repository of the GigaSpeech dataset. For details of how we created the dataset, please refer to our Interspeech paper: "GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio". Preprint available on arxiv.

GigaSpeech version: 1.0.0 (07/05/2021)

Download

  1. Step 1: Please fill out the Google Form here
  2. Step 2:
    • Option A: Follow the instructions in replied email from SpeechColab to get the raw release of GigaSpeech
    • Option B: Refer to GigaSpeech On HuggingFace to get a pre-processed version of GigaSpeech via HuggingFace.

Leaderboard

ContributorToolkitTrain RecipeTrain DataInferenceDev/Test WER
<em>Baseline</em>AthenaTransformer-AED + RNNLMGigaSpeech v1.0.0 XLmodel example13.60 / 12.70
<em>Baseline</em>EspnetConformer/Transformer-AEDGigaSpeech v1.0.0 XLmodel example10.90 / 10.80
<em>Baseline</em>KaldiChain + RNNLMGigaSpeech v1.0.0 XL<u>model</u> <u>example</u>14.78 / 14.84
<em>Baseline</em>PikaRNN-TGigaSpeech v1.0.0 XL<u>model</u> <u>example</u>12.30 / 12.30
Johns Hopkins UniversityIcefallTransducer: Zipformer encoder + Embedding decoderGigaSpeech v1.0.0 XLmodel example10.25 / 10.38
Johns Hopkins UniversityIcefallPruned Stateless RNN-TGigaSpeech v1.0.0 XLmodel example10.40 / 10.51
Johns Hopkins UniversityIcefallConformer CTC + <br> ngram & attention rescoringGigaSpeech v1.0.0 XLmodel example10.47 / 10.58
MobvoiWenetJoint CTC/AED(U2++)GigaSpeech v1.0.0 XLmodel example10.70 / 10.60
ByteDance AI LabNeurSTTransformer-AEDGigaSpeech v1.0.0 XLmodel example11.89 / 11.60

Dataset

Audio Source

Audio SourceTranscribed HoursTotal HoursAcoustic Condition
Audiobook2,65511,982<li>Reading</li><li>Various ages and accents</li>
Podcast3,4989,254<li>Clean or background music</li><li>Indoor</li><li>Near-field</li><li>Spontaneous</li><li>Various ages and accents</li>
YouTube3,84511,768<li>Clean and noisy</li><li>Indoor and outdoor</li><li>Near- and far-field</li><li>Reading and spontaneous</li><li>Various ages and accents</li>
total10,00033,005

Transcribed Training Subsets

SubsetHoursRemarks
XS10System building and debugging
S250Quick research experiments
M1,000Large-scale research experiments
L2,500Medium-scale industrial experiments
XL10,000Large-scale industrial experiments

Larger subsets are supersets of smaller subsets, e.g., subset L contains all the data from subset M.

Transcribed Evaluation Subsets

SubsetHoursRemarks
Dev12Randomly selected from the crawled Podcast and YouTube Data
Test40Part of the subset was randomly selected from the crawled Podcast and YouTube data; part of it was manually collected through other channels to have better coverage.

Evaluation subsets are annotated by professional human annotators

Data Preparation Guidelines

We maintain data preparation scripts for different speech recognition toolkits in this repository so that when we update the dataset (note, this is an evolving dataset), we don't have to update the scripts in the downstream toolkits. Data preparation scripts for different speech recognition toolkits are maintained in the toolkits/ folder, e.g., toolkits/kaldi for the Kaldi speech recognition toolkit.

Preparation Scripts

To use the data preparation scripts, do the following in your toolkit (here we use Kaldi as an example)

git clone https://github.com/SpeechColab/GigaSpeech.git

cd GigaSpeech
utils/download_gigaspeech.sh /disk1/audio_data/gigaspeech
toolkits/kaldi/gigaspeech_data_prep.sh --train-subset XL /disk1/audio_data/gigaspeech ../data
cd ..

Metadata walkthrough

We save all the metadata information to a single JSON file named GigaSpeech.json. Below is a snip of this file:

{
  "dataset": "GigaSpeech",
  "language": "EN",
  "version": "v1.0.0",
  ... ...
  "audios": [
    {
      "title": "The Architect of Hollywood",
      "url": "https://99percentinvisible.org/episode/the-architect-of-hollywood/download",
      "path": "audio/podcast/P0001/POD0000000025.opus",
      ... ...
      "segments": [
        {
          "sid": "POD0000000025_S0000103",
          "speaker": "N/A",
          "begin_time": 780.31,
          "end_time": 783.13,
          "text_tn": "FOUR O'CLOCK TOMORROW AFTERNOON <COMMA> SAID WILLIAMS <PERIOD>",
          "subsets": [
            "{XL}",
            "{L}"
          ]
        },
        ... ...
      ],
      ... ...
    },
    ... ...
  ]
}

To use the corpus, users are expected to extract the relevant information from GigaSpeech.json. For example, for the speech recognition task, one should first follow the "audios" entry, and work out a list of audio files. One can then follow the "url" entry to download the original audio file, or "path" if preprocessed audio files have been downloaded to the disk. After that, for each audio file, one can follow the "segments" entry, and work out the trainable audio segments, as well as their corresponding transcripts. Of course, we also have various supplementary entries, such as "subsets", "md5", which will also be helpful for your task.

The metadata file GigaSpeech.json is version controlled, and is supposed to get updated over the time. In future releases, we plan to add speaker information to the metadata file, so that it will be suitable for speaker identification/verification tasks. We also plan to add more data from different sources to increase the diversity.

We also provide some convenient command-line tools based on jq, e.g., utils/ls_audio.sh, utils/show_segment_info.sh, utils/ls_md5.sh.

Audio Processing

Text Pre-Processing

Text Post-Processing (before scoring)

Add Support for a New Toolkit

To add data preparation support for a new toolkit, please follow toolkits/kaldi/gigaspeech_data_prep.sh and add similar scripts for your own toolkit. For example, for ESPnet2, you would add toolkits/espnet2/gigaspeech_data_prep.sh to prepare the dataset, and all other related scripts should be maintained under toolkits/espnet2.

Collaboration

We are a group of volunteers trying to make speech technologies easier to use. We welcome any kind of contributions. Currently we are exploring the following directions. If you are interested in one of the directions, and you think you will be able to help, please contact gigaspeech@speechcolab.org.

Institutional Contributors

InstitutionContribution
IEIT, Tsinghua UniversityComputing power; Data host; Researchers
Magic DataData host mirror
speechoceanData host mirror; Evaluation data annotation
Xiaomi CorporationComputing power; Researchers

Citation

Please cite our paper if you find this work useful:

@inproceedings{GigaSpeech2021,
  title={GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio},
  booktitle={Proc. Interspeech 2021},
  year=2021,
  author={Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan}
}

Contact

If you have any concerns, please contact gigaspeech@speechcolab.org.

Metadata Changelog