Home

Awesome

wer_are_we

WER are we? An attempt at tracking states of the art(s) and recent results on speech recognition. Feel free to correct! (Inspired by Are we there yet?)

WER

LibriSpeech

(Possibly trained on more data than LibriSpeech.)

WER test-cleanWER test-otherPaperPublishedNotes
5.83%12.69%Humans Deep Speech 2: End-to-End Speech Recognition in English and MandarinDecember 2015Humans
1.8%2.9%HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden UnitsJune 2021CNN-Transformer + Transformer LM (Self-Supervised, Libri-light-60K Unlabeled Data)
1.9%3.9%Conformer: Convolution-augmented Transformer for Speech RecognitionMay 2020Convolution-augmented-Transformer(Conformer) + 3-layer LSTM LM (data augmentation:SpecAugment)
1.9%4.1%ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global ContextMay 2020CNN-RNN-Transducer(ContextNet) + 3-layer LSTM LM (data augmentation:SpecAugment)
2.0%4.1%End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern ArchitecturesNovember 2019Conv+Transformer AM (10k word pieces) with ConvLM decoding and Transformer rescoring + 60k hours unlabeled
2.1%4.1%Efficient Training of Neural Transducer for Speech RecognitionMay 2022Conformer Transducer (efficient 3-stage progressive training: 35 epochs in total) + Transformer LM
2.3%4.9%Transformer-based Acoustic Modeling for Hybrid Speech RecognitionOctober 2019Transformer AM (chenones) + 4-gram LM + Neural LM rescore (data augmentation:Speed perturbation and SpecAugment)
2.3%5.0%RWTH ASR Systems for LibriSpeech: Hybrid vs AttentionSeptember 2019, InterspeechHMM-DNN + lattice-based sMBR + LSTM LM + Transformer LM rescoring (no data augmentation)
2.3%5.2%End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern ArchitecturesNovember 2019Conv+Transformer AM (10k word pieces) with ConvLM decoding and Transformer rescoring
2.2%5.8%State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D ConvolutionsOctober 2019Multi-stream self-attention in hybrid ASR + 4-gram LM + Neural LM rescore (no data augmentation)
2.5%5.8%SpecAugment: A Simple Data Augmentation Method for Automatic Speech RecognitionApril 2019Listen Attend Spell
3.2%7.6%From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech RecognitionOctober 2019LC-BLSTM AM (chenones) + 4-gram LM (data augmentation:Speed perturbation and SpecAugment)
3.19%7.64%The CAPIO 2017 Conversational Speech Recognition SystemApril 2018TDNN + TDNN-LSTM + CNN-bLSTM + Dense TDNN-LSTM across two kinds of trees + N-gram LM + Neural LM rescore
2.44%8.29%Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition SystemSeptember 2019, Interspeechencoder-attention-decoder + Transformer LM
3.80%8.76%Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural NetworksInterspeech, Sept 2018Kaldi recipe, 17-layer TDNN-F + iVectors
2.8%9.3%RWTH ASR Systems for LibriSpeech: Hybrid vs AttentionSeptember 2019, Interspeechencoder-attention-decoder + BPE + Transformer LM (no data augmentation)
3.26%10.47%Fully Convolutional Speech RecognitionDecember 2018End-to-end CNN on the waveform + conv LM
3.82%12.76%Improved training of end-to-end attention models for speech recognitionInterspeech, Sept 2018encoder-attention-decoder end-to-end model
4.28%Purely sequence-trained neural networks for ASR based on lattice-free MMISeptember 2016HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations
4.83%A time delay neural network architecture for efficient modeling of long temporal contexts2015HMM-TDNN + iVectors
5.15%12.73%Deep Speech 2: End-to-End Speech Recognition in English and MandarinDecember 20159-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 100M parameters trained on 11940h
5.51%13.97%LibriSpeech: an ASR Corpus Based on Public Domain Audio Books2015HMM-DNN + pNorm*
4.8%14.5%Letter-Based Speech Recognition with Gated ConvNetsDecember 2017(Gated) ConvNet for AM going to letters + 4-gram LM
8.01%22.49%same, Kaldi2015HMM-(SAT)GMM
12.51%Audio Augmentation for Speech Recognition2015TDNN + pNorm + speed up/down speech

WSJ

(Possibly trained on more data than WSJ.)

WER eval'92WER eval'93PaperPublishedNotes
5.03%8.08%Humans Deep Speech 2: End-to-End Speech Recognition in English and MandarinDecember 2015Humans
2.9%End-to-end Speech Recognition Using Lattice-Free MMISeptember 2018HMM-DNN LF-MMI trained (biphone)
3.10%Deep Speech 2: End-to-End Speech Recognition in English and MandarinDecember 20159-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 100M parameters
3.47%Deep Recurrent Neural Networks for Acoustic ModellingApril 2015TC-DNN-BLSTM-DNN
3.5%6.8%Fully Convolutional Speech RecognitionDecember 2018End-to-end CNN on the waveform + conv LM
3.63%5.66%LibriSpeech: an ASR Corpus Based on Public Domain Audio Books2015test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm*
4.1%End-to-end Speech Recognition Using Lattice-Free MMISeptember 2018HMM-DNN E2E LF-MMI trained (word n-gram)
5.6%Convolutional Neural Networks-based Continuous Speech Recognition using Raw Speech Signal2014CNN over RAW speech (wav)
5.7%8.7%End-to-end Speech Recognition from the Raw WaveformJune 2018End-to-end CNN on the waveform

Hub5'00 Evaluation (Switchboard / CallHome)

(Possibly trained on more data than SWB, but test set = full Hub5'00.)

WER (SWB)WER (CH)PaperPublishedNotes
4.9%9.5%An investigation of phone-based subword units for end-to-end speech recognitionApril 20202 CNN + 24 layers Transformer encoder and 12 layers Transformer decoder model with char BPE and phoneme BPE units.
5.0%9.1%The CAPIO 2017 Conversational Speech Recognition SystemDecember 20172 Dense LSTMs + 3 CNN-bLSTMs across 3 phonesets from previous Capio paper & AM adaptation using parameter averaging (5.6% SWB / 10.5% CH single systems)
5.1%9.9%Language Modeling with Highway LSTMSeptember 2017HW-LSTM LM trained with Switchboard+Fisher+Gigaword+Broadcast News+Conversations, AM from previous IBM paper
5.1%The Microsoft 2017 Conversational Speech Recognition SystemAugust 2017~2016 system + character-based dialog session aware (turns of speech) LSTM LM
5.3%10.1%Deep Learning-based Telephony Speech Recognition in the WildAugust 2017Ensemble of 3 CNN-bLSTM (5.7% SWB / 11.3% CH single systems)
5.5%10.3%English Conversational Telephone Speech Recognition by Humans and MachinesMarch 2017ResNet + BiLSTMs acoustic model, with 40d FMLLR + i-Vector inputs, trained on SWB+Fisher+CH, n-gram + model-M + LSTM + Strided (à trous) convs-based LM trained on Switchboard+Fisher+Gigaword+Broadcast
6.3%11.9%The Microsoft 2016 Conversational Speech Recognition SystemSeptember 2016VGG/Resnet/LACE/BiLSTM acoustic model trained on SWB+Fisher+CH, N-gram + RNNLM language model trained on Switchboard+Fisher+Gigaword+Broadcast
6.3%13.3%An investigation of phone-based subword units for end-to-end speech recognitionApril 20202 CNN + 24 layers Transformer encoder and 12 layers Transformer decoder model with char BPE and phoneme BPE units. Trained only on SWBD 300 hours.
6.6%12.2%The IBM 2016 English Conversational Telephone Speech Recognition SystemJune 2016RNN + VGG + LSTM acoustic model trained on SWB+Fisher+CH, N-gram + "model M" + NNLM language model
6.8%14.1%SpecAugment: A Simple Data Augmentation Method for Automatic Speech RecognitionApril 2019Listen Attend Spell
8.5%13%Purely sequence-trained neural networks for ASR based on lattice-free MMISeptember 2016HMM-BLSTM trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + Fisher
9.2%13.3%Purely sequence-trained neural networks for ASR based on lattice-free MMISeptember 2016HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + Fisher (10% / 15.1% respectively trained on SWBD only)
12.6%16%Deep Speech: Scaling up end-to-end speech recognitionDecember 2014CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trained only on SWB
11%17.1%A time delay neural network architecture for efficient modeling of long temporal contexts2015HMM-TDNN + iVectors
12.6%18.4%Sequence-discriminative training of deep neural networks2013HMM-DNN +sMBR
12.9%19.3%Audio Augmentation for Speech Recognition2015HMM-TDNN + pNorm + speed up/down speech
15%19.1%Building DNN Acoustic Models for Large Vocabulary Speech RecognitionJune 2014DNN + Dropout
10.4%Joint Training of Convolutional and Non-Convolutional Neural Networks2014CNN on MFSC/fbanks + 1 non-conv layer for FMLLR/I-Vectors concatenated in a DNN
11.5%Deep Convolutional Neural Networks for LVCSR2013CNN
12.2%Very Deep Multilingual Convolutional Neural Networks for LVCSRSeptember 2015Deep CNN (10 conv, 4 FC layers), multi-scale feature maps
11.8%25.7%Improved training of end-to-end attention models for speech recognitionInterspeech, Sept 2018encoder-attention-decoder end-to-end model, trained on 300h SWB

Rich Transcriptions

WER RT-02WER RT-03WER RT-04PaperPublishedNotes
8.1%8.0%The CAPIO 2017 Conversational Speech Recognition SystemApril 20182 Dense LSTMs + 3 CNN-bLSTMs across 3 phonesets from previous Capio paper & AM adaptation using parameter averaging
8.2%8.1%7.7%Language Modeling with Highway LSTMSeptember 2017HW-LSTM LM trained with Switchboard+Fisher+Gigaword+Broadcast News+Conversations, AM from previous IBM paper
8.3%8.0%7.7%English Conversational Telephone Speech Recognition by Humans and MachinesMarch 2017ResNet + BiLSTMs acoustic model, with 40d FMLLR + i-Vector inputs, trained on SWB+Fisher+CH, n-gram + model-M + LSTM + Strided (à trous) convs-based LM trained on Switchboard+Fisher+Gigaword+Broadcast

Fisher (RT03S FSH)

WERPaperPublishedNotes
9.6%Purely sequence-trained neural networks for ASR based on lattice-free MMISeptember 2016HMM-BLSTM trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + SWBD
9.8%Purely sequence-trained neural networks for ASR based on lattice-free MMISeptember 2016HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + SWBD

TED-LIUM

WER TestPaperPublishedNotes
5.6%The RWTH ASR System for TED-LIUM release 2: Improving Hybrid HMM with SpecAugmentApril 2020HMM-BLSTM + iVectors + SpecAugment + sMBR + Transformer LM
6.5%The CAPIO 2017 Conversational Speech Recognition SystemApril 2018TDNN + TDNN-LSTM + CNN-bLSTM + Dense TDNN-LSTM across two kinds of trees
11.2%Purely sequence-trained neural networks for ASR based on lattice-free MMISeptember 2016HMM-TDNN trained with LF-MMI + data augmentation (speed perturbation) + iVectors + 3 regularizations
15.3%TED-LIUM: an Automatic Speech Recognition dedicated corpusMay 2014Multi-layer perceptron (MLP) with bottle-neck feature extraction

CHiME6 (multiarray noisy speech)

WER (fixed LM)WER (unlimited LM)PaperPublishedNotes
31.0%30.5%The USTC-NELSLIP Systems for CHiME-6 ChallengeMay 2020WPE + SSA + GSS + Data Augment (Speed, Volume) + SpecAugment + 8 AMs fusion (2 Single-feature AM + 6 Multi-feature AM)
35.1%34.5%The IOA Systems for CHiME-6 ChallengeMay 2020WPE + multi-stage GSS + SpecAugment + Data Augment (Noise, Reverberation, Speed) + 3 AMs fusion (CNN-TDNNF / CNN-TDNN-BLSTM / CNN-BLSTM)
35.8%33.9%The STC System for the CHiME-6 ChallengeMay 2020WPE + GSS + SpecAugment + 3 AMs fusion ( 2 TDNN-F / CNN-TDNNF + stats + SpecAugment + self-attention + sMBR) + MBR Decoding
51.3%51.3%CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented RecordingsMay 2020WPE + GSS + Data Augment (Noise, Reverberation, Speed) + TDNNF

CHiME (noisy speech)

cleanrealsimPaperPublishedNotes
3.34%21.79%45.05%Deep Speech 2: End-to-End Speech Recognition in English and MandarinDecember 20159-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 68M parameters
6.30%67.94%80.27%Deep Speech: Scaling up end-to-end speech recognitionDecember, 2014CNN + Bi-RNN + CTC (speech to letters)

TODO

PER

TIMIT

(So far, all results trained on TIMIT and tested on the core test set.)

PERPaperPublishedNotes
12.9%Instantaneous Frequency Filter-Bank Features for Low Resource Speech Recognition Using Deep Recurrent ArchitecturesSeptember 2021Li-GRU with FMLLR + IFFB + FBANK + IFFB-FMLLR features
13.8%The Pytorch-Kaldi Speech Recognition ToolkitFebruary 2019MLP+Li-GRU+MLP on MFCC+FBANK+fMLLR. Silence phones are removed from reference and hypothesis transcripts!
14.9%Light Gated Recurrent Units for Speech RecognitionMarch 2018Removing the reset gate in GRU, using ReLU activation instead of tanh and batch normalization
16.5%Phone recognition with hierarchical convolutional deep maxout networksSeptember 2015Hierarchical maxout CNN + Dropout
16.5%A Regularization Post Layer: An Additional Way how to Make Deep Neural Networks Robust2017DBN with last layer regularization
16.7%Combining Time- and Frequency-Domain Convolution in Convolutional Neural Network-Based Phone Recognition2014CNN in time and frequency + dropout, 17.6% w/o dropout
16.8%An investigation into instantaneous frequency estimation methods for improved speech recognition featuresNovember 2017DNN-HMM with MFCC + IFCC features
17.3%Segmental Recurrent Neural Networks for End-to-end Speech RecognitionMarch 2016RNN-CRF on 24(x3) MFSC
17.6%Attention-Based Models for Speech RecognitionJune 2015Bi-RNN + Attention
17.7%Speech Recognition with Deep Recurrent Neural NetworksMarch 2013Bi-LSTM + skip connections w/ RNN transducer (18.4% with CTC only)
18.0%Learning Filterbanks from Raw Speech for Phone RecognitionOctober 2017Complex ConvNets on raw speech w/ mel-fbanks init
18.8%Wavenet: A Generative Model For Raw AudioSeptember 2016Wavenet architecture with mean pooling layer after residual block + few non-causal conv layers
23%Deep Belief Networks for Phone Recognition2009(first, modern) HMM-DBN

LM

TODO

Noise-robust ASR

TODO

BigCorp™®-specific dataset

TODO?

Lexicon