Home

Awesome

WenetSpeech

Official website | Paper

A 10000+ Hours Multi-domain Chinese Corpus for Speech Recognition

WenetSpeech

Download

Please visit the official website, read the license, and follow the instruction to apply for the PASSWORD to download the data.

echo 'PASSWORD' > SAFEBOX/password

From Tecent Meeting (default)

Download WenetSpeech:

bash utils/download_wenetspeech.sh DOWNLOAD_DIR UNTAR_DIR

From ModelScope

Install modelscope (depends on torch) before downloading:

conda create -n modelscope python=3.7
conda activate modelscope
pip install torch
pip install modelscope -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

Download WenetSpeech from modelscope:

sed -i 's/modelscope=false/modelscope=true/g' utils/download_wenetspeech.sh
bash utils/download_wenetspeech.sh DOWNLOAD_DIR UNTAR_DIR

Discussion & Communication

Please scan the QR code on the left to follow our offical account of WeNet. We created a WeChat group for better discussion and quicker response. Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.

<img src="https://github.com/robin1001/qr/blob/master/wenet.jpeg" width="250px"><img src="https://github.com/wenet-e2e/wenet-contributors/blob/main/wenetspeech/lvhang.jpg" width="250px">

Benchmark

ToolkitDevTest_NetTest_MeetingAIShell-1
Kaldi9.0712.8324.725.41
ESPNet9.708.9015.903.90
WeNet8.889.7015.594.61

Description

Creation

All the data are collected from YouTube and Podcast. Optical character recognition (OCR) and automatic speech recognition (ASR) techniques are adopted to label each YouTube and Podcast recording, respectively. To improve the quality of the corpus, we use a novel end-to-end label error detection method to further validate and filter the data.

Categories

In summary, WenetSpeech groups all data into 3 categories, as the following table shows:

SetHoursConfidenceUsage
High Label10005>=0.95Supervised Training
Weak Label2478[0.6, 0.95]Semi-supervised or noise training
Unlabel9952/Unsupervised training or Pre-training
In Total22435/All above

High Label Data

We classify the high label into 10 groups according to its domain, speaking style, and scenarios.

DomainYoutubePodcastTotal
audiobook0250.9250.9
commentary112.6135.7248.3
documentary386.790.5477.2
drama4338.204338.2
interview324.2614938.2
news0868868
reading01110.21110.2
talk20490.7294.7
variety603.3224.5827.8
others144507.5651.5
Total6113389210005

As shown in the following table, we provide 3 training subsets, namely S, M and L for building ASR systems on different data scales.

Training SubsetsConfidenceHours
L[0.95, 1.0]10005
M1.01000
S1.0100

Evaluation Sets

Evaluation SetsHoursSourceDescription
DEV20InternetSpecially designed for some speech tools which require cross-validation set in training
TEST_NET23InternetMatch test
TEST_MEETING15Real meetingMismatch test which is a far-field, conversational, spontaneous, and meeting dataset

Contributors

<a href="http://lxie.npu-aslp.org" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/colleges/nwpu.png" width="250px"></a><a href="https://www.chumenwenwen.com" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/chumenwenwen.png" width="250px"></a><a href="http://www.aishelltech.com" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/aishelltech.png" width="250px"></a>
<a href="" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/WenetSpeech/gh-pages/assets/img/tencent.png" width="250px"></a><a href="" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/WenetSpeech/gh-pages/assets/img/MindSpore.png" width="250px"></a><a href="" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/WenetSpeech/gh-pages/assets/img/xian.png" width="250px"></a>

ACKNOWLEDGEMENTS