Awesome

awesome-speech-recognition-speech-synthesis-papers

Paper List

Text-to-Audio
Automatic Speech Recognition(ASR)
Speaker Verification
Voice Conversion(VC)
Speech Synthesis(TTS)
Language Modelling
Confidence Estimates
Music Modelling
Interesting papers

Text to Audio

AudioLM: a Language Modeling Approach to Audio Generation(2022), Zalán Borsos et al. [pdf]
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models(2023), Haohe Liu et al. [pdf]
MusicLM: Generating Music From Text(2023), Andrea Agostinelli et al. [pdf]
Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion(2023), Flavio Schneider et al. [pdf]
Noise2Music: Text-conditioned Music Generation with Diffusion Models(2023), Qingqing Huang et al. [pdf]

Automatic Speech Recognition

An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition(1982), S. E. LEVINSON et al. [pdf]
A Maximum Likelihood Approach to Continuous Speech Recognition(1983), LALIT R. BAHL et al. [pdf]
Heterogeneous Acoustic Measurements and Multiple Classifiers for Speech Recognition(1986), Andrew K. Halberstadt. [pdf]
Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition(1986), Lalit R. Bahi et al. [pdf]
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition(1989), Lawrence R Rabiner. [pdf]
Phoneme recognition using time-delay neural networks(1989), Alexander H. Waibel et al. [pdf]
Speaker-independent phone recognition using hidden Markov models(1989), Kai-Fu Lee et al. [pdf]
Hidden Markov Models for Speech Recognition(1991), B. H. Juang et al. [pdf]
Review of Tdnn (time Delay Neural Network) Architectures for Speech Recognition(2014), Masahide Sugiyamat et al. [pdf]
Connectionist Speech Recognition: A Hybrid Approach(1994), Herve Bourlard et al. [pdf]
A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER)(1997), J.G. Fiscus. [pdf]
Speech recognition with weighted finite-state transducers(2001), M Mohri et al. [pdf]
Framewise phoneme classification with bidirectional LSTM and other neural network architectures(2005), Alex Graves et al. [pdf]
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks(2006), Alex Graves et al. [pdf]
The kaldi speech recognition toolkit(2011), Daniel Povey et al. [pdf]
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition(2012), Ossama Abdel-Hamid et al. [pdf]
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition(2012), George E. Dahl et al. [pdf]
Deep Neural Networks for Acoustic Modeling in Speech Recognition(2012), Geoffrey Hinton et al. [pdf]
Sequence Transduction with Recurrent Neural Networks(2012), Alex Graves et al. [pdf]
Deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]
Improving deep neural networks for LVCSR using rectified linear units and dropout(2013), George E. Dahl et al. [pdf]
Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training(2013), Yajie Miao et al. [pdf]
Improvements to deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]
Machine Learning Paradigms for Speech Recognition: An Overview(2013), Li Deng et al. [pdf]
Recent advances in deep learning for speech research at Microsoft(2013), Li Deng et al. [pdf]
Speech recognition with deep recurrent neural networks(2013), Alex Graves et al. [pdf]
Convolutional deep maxout networks for phone recognition(2014), László Tóth et al. [pdf]
Convolutional Neural Networks for Speech Recognition(2014), Ossama Abdel-Hamid et al. [pdf]
Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition(2014), László Tóth. [pdf]
Deep Speech: Scaling up end-to-end speech recognition(2014), Awni Y. Hannun et al. [pdf]
End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results(2014), Jan Chorowski et al. [pdf]
First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs(2014), Andrew L. Maas et al. [pdf]
Long short-term memory recurrent neural network architectures for large scale acoustic modeling(2014), Hasim Sak et al. [pdf]
Robust CNN-based speech recognition with Gabor filter kernels(2014), Shuo-Yiin Chang et al. [pdf]
Stochastic pooling maxout networks for low-resource speech recognition(2014), Meng Cai et al. [pdf]
Towards End-to-End Speech Recognition with Recurrent Neural Networks(2014), Alex Graves et al. [pdf]
A neural transducer(2015), N Jaitly et al. [pdf]
Attention-Based Models for Speech Recognition(2015), Jan Chorowski et al. [pdf]
Analysis of CNN-based speech recognition system using raw speech as input(2015), Dimitri Palaz et al. [pdf]
Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks(2015), Tara N. Sainath et al. [pdf]
Deep convolutional neural networks for acoustic modeling in low resource languages(2015), William Chan et al. [pdf]
Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition(2015), Chao Weng et al. [pdf]
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding(2015), Y Miao et al. [pdf]
Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition(2015), Hasim Sak et al. [pdf]
Lexicon-Free Conversational Speech Recognition with Neural Networks(2015), Andrew L. Maas et al. [pdf]
Online Sequence Training of Recurrent Neural Networks with Connectionist Temporal Classification(2015), Kyuyeon Hwang et al. [pdf]
Advances in All-Neural Speech Recognition(2016), Geoffrey Zweig et al. [pdf]
Advances in Very Deep Convolutional Neural Networks for LVCSR(2016), Tom Sercu et al. [pdf]
End-to-end attention-based large vocabulary speech recognition(2016), Dzmitry Bahdanau et al. [pdf]
Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention(2016), Dong Yu et al. [pdf]
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin(2016), Dario Amodei et al. [pdf]
End-to-end attention-based distant speech recognition with Highway LSTM(2016), Hassan Taherian. [pdf]
Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning(2016), Suyoun Kim et al. [pdf]
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition(2016), William Chan et al. [pdf]
Latent Sequence Decompositions(2016), William Chan et al. [pdf]
Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks(2016), Tara N. Sainath et al. [pdf]
Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2016), Suyoun Kim et al. [pdf]
Segmental Recurrent Neural Networks for End-to-End Speech Recognition(2016), Liang Lu et al. [pdf]
Towards better decoding and language model integration in sequence to sequence models(2016), Jan Chorowski et al. [pdf]
Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition(2016), Yanmin Qian et al. [pdf]
Very Deep Convolutional Networks for End-to-End Speech Recognition(2016), Yu Zhang et al. [pdf]
Very deep multilingual convolutional neural networks for LVCSR(2016), Tom Sercu et al. [pdf]
Wav2Letter: an End-to-End ConvNet-based Speech Recognition System(2016), Ronan Collobert et al. [pdf]
Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech(2017), Michael Neumann et al. [pdf]
An enhanced automatic speech recognition system for Arabic(2017), Mohamed Amine Menacer et al. [pdf]
Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM(2017), Takaaki Hori et al. [pdf]
A network of deep neural networks for distant speech recognition(2017), Mirco Ravanelli et al. [pdf]
An online sequence-to-sequence model for noisy speech recognition(2017), Chung-Cheng Chiu et al. [pdf]
An Unsupervised Speaker Clustering Technique based on SOM and I-vectors for Speech Recognition Systems(2017), Hany Ahmed et al. [pdf]
Attention-Based End-to-End Speech Recognition in Mandarin(2017), C Shan et al. [pdf]
Building DNN acoustic models for large vocabulary speech recognition(2017), Andrew L. Maas et al. [pdf]
Direct Acoustics-to-Word Models for English Conversational Speech Recognition(2017), Kartik Audhkhasi et al. [pdf]
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments(2017), Zixing Zhang et al. [pdf]
English Conversational Telephone Speech Recognition by Humans and Machines(2017), George Saon et al. [pdf]
ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA(2017), Song Han et al. [pdf]
Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition(2017), Chris Donahue et al. [pdf]
Deep LSTM for Large Vocabulary Continuous Speech Recognition(2017), Xu Tian et al. [pdf]
Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition(2017), Taesup Kim et al. [pdf]
Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling(2017), Hairong Liu et al. [pdf]
Improving the Performance of Online Neural Transducer Models(2017), Tara N. Sainath et al. [pdf]
Learning Filterbanks from Raw Speech for Phone Recognition(2017), Neil Zeghidour et al. [pdf]
Multichannel End-to-end Speech Recognition(2017), Tsubasa Ochiai et al. [pdf]
Multi-task Learning with CTC and Segmental CRF for Speech Recognition(2017), Liang Lu et al. [pdf]
Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition(2017), Tara N. Sainath et al. [pdf]
Multilingual Speech Recognition With A Single End-To-End Model(2017), Shubham Toshniwal et al. [pdf]
Optimizing expected word error rate via sampling for speech recognition(2017), Matt Shannon. [pdf]
Residual Convolutional CTC Networks for Automatic Speech Recognition(2017), Yisen Wang et al. [pdf]
Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition(2017), Jaeyoung Kim et al. [pdf]
Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2017), Suyoun Kim et al. [pdf]
Reducing Bias in Production Speech Models(2017), Eric Battenberg et al. [pdf]
Robust Speech Recognition Using Generative Adversarial Networks(2017), Anuroop Sriram et al. [pdf]
State-of-the-art Speech Recognition With Sequence-to-Sequence Models(2017), Chung-Cheng Chiu et al. [pdf]
Towards Language-Universal End-to-End Speech Recognition(2017), Suyoun Kim et al. [pdf]
Accelerating recurrent neural network language model based online speech recognition system(2018), K Lee et al. [pdf]
An improved hybrid CTC-Attention model for speech recognition(2018), Zhe Yuan et al. [pdf]
Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units(2018), Zhangyu Xiao et al. [pdf]
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition(2019), Daniel S. Park et al. [pdf]
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations(2019), Alexei Baevski et al. [pdf]
Effectiveness of self-supervised pre-training for speech recognition(2020), Alexei Baevski et al. [pdf]
Improved Noisy Student Training for Automatic Speech Recognition(2020), Daniel S. Park, et al. [pdf]
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context (2020), Wei Han, et al. [pdf]
Conformer: Convolution-augmented Transformer for Speech Recognition(2020), Anmol Gulati, et al. [pdf]
On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition(2020), Jinyu Li et al. [pdf]
Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations(2021), Melikasadat Emami et al. [pdf]
Efficient Training of Audio Transformers with Patchout(2021), Khaled Koutini et al. [pdf]
MixSpeech: Data Augmentation for Low-resource Automatic Speech Recognition(2021), Linghui Meng et al. [pdf]
Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition(2021), Timo Lohrenz et al. [pdf]
SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification(2021), Helin Wang et al. [pdf]
SpecMix: A Mixed Sample Data Augmentation method for Training with Time-Frequency Domain Features(2021), Gwantae Kim et al. [pdf]
The History of Speech Recognition to the Year 2030(2021), Awni Hannun et al. [pdf]
Voice Conversion Can Improve ASR in Very Low-Resource Settings(2021), Matthew Baas et al. [pdf]
Why does CTC result in peaky behavior?(2021), Albert Zeyer et al. [pdf]
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR(2022), W. Ronny Huang et al. [pdf]
Music Source Separation with Generative Flow(2022), Ge Zhu et al. [pdf]
Improving Self-Supervised Speech Representations by Disentangling Speakers(2022), Kaizhi Qian et al. [pdf]
Robust Speech Recognition via Large-Scale Weak Supervision(2022), Alec Radford et al. [pdf]
On decoder-only architecture for speech-to-text and large language model integration(2023), Jian Wu et al. [pdf]

Speaker Verification

Speaker Verification Using Adapted Gaussian Mixture Models(2000), Douglas A.Reynolds et al. [pdf]
A tutorial on text-independent speaker verification(2004), Frédéric Bimbot et al. [pdf]
Deep neural networks for small footprint text-dependent speaker verification(2014), E Variani et al. [pdf]
Deep Speaker Vectors for Semi Text-independent Speaker Verification(2015), Lantian Li et al. [pdf]
Deep Speaker: an End-to-End Neural Speaker Embedding System(2017), Chao Li et al. [pdf]
Deep Speaker Feature Learning for Text-independent Speaker Verification(2017), Lantian Li et al. [pdf]
Deep Speaker Verification: Do We Need End to End?(2017), Dong Wang et al. [pdf]
Speaker Diarization with LSTM(2017), Quan Wang et al. [pdf]
Text-Independent Speaker Verification Using 3D Convolutional Neural Networks(2017), Amirsina Torfi et al. [pdf]
End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances(2017), Chunlei Zhang et al. [pdf]
Deep Neural Network Embeddings for Text-Independent Speaker Verification(2017), David Snyder et al. [pdf]
Deep Discriminative Embeddings for Duration Robust Speaker Verification(2018), Na Li et al. [pdf]
Learning Discriminative Features for Speaker Identification and Verification(2018), Sarthak Yadav et al. [pdf]
Large Margin Softmax Loss for Speaker Verification(2019), Yi Liu et al. [pdf]
Unsupervised feature enhancement for speaker verification(2019), Phani Sankar Nidadavolu et al. [pdf]
Feature enhancement with deep feature losses for speaker verification(2019), Saurabh Kataria et al. [pdf]
Generalized end2end loss for speaker verification(2019), Li Wan et al. [pdf]
Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification(2019), Youngmoon Jung et al. [pdf]
VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge(2019), Son Chung et al. [pdf]
BUT System Description to VoxCeleb Speaker Recognition Challenge 2019(2019), Hossein Zeinali et al. [pdf]
The ID R&D System Description for Short-duration Speaker Verification Challenge 2021(2021), Alenin et al. [pdf]

Voice Conversion

Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks(2015), Lifa Sun et al. [pdf]
Phonetic posteriorgrams for many-to-one voice conversion without parallel data training(2016), Lifa Sun et al. [pdf]
StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks(2018), Hirokazu Kameoka et al. [pdf]
AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss(2019), Kaizhi Qian et al. [pdf]
StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion(2019), Takuhiro Kaneko et al. [pdf]
Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion(2019), Andy T. Liu et al. [pdf]
Attention-Based Speaker Embeddings for One-Shot Voice Conversion(2020), Tatsuma Ishihara et al. [pdf]
F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder(2020), Kaizhi Qian et al. [pdf]
Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning(2020), Jing-Xuan Zhang et al. [pdf]
An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation(2021), Xiangheng He et al. [pdf]
crank: An Open-Source Software for Nonparallel Voice Conversion Based on Vector-Quantized Variational Autoencoder(2021), Kazuhiro Kobayashi et al. [pdf]
CVC: Contrastive Learning for Non-parallel Voice Conversion(2021), Tingle Li et al. [pdf]
NoiseVC: Towards High Quality Zero-Shot Voice Conversion(2021), Shijun Wang et al. [pdf]
On Prosody Modeling for ASR+TTS based Voice Conversion(2021), Wen-Chin Huang et al. [pdf]
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion(2021), Yinghao Aaron Li et al. [pdf]
Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning(2021), Shijun Wang et al. [pdf]

Speech Synthesis

Signal estimation from modified short-time Fourier transform(1993), Daniel W. Griffin et al. [pdf]
Text-to-speech synthesis(2009), Paul Taylor et al. [pdf]
A fast Griffin-Lim algorithm(2013), Nathanael Perraudin et al. [pdf]
TTS synthesis with bidirectional LSTM based recurrent neural networks(2014), Yuchen Fan et al. [pdf]
First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention(2016), Wenfu Wang et al. [pdf]
Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer(2016), Xavi Gonzalvo et al. [pdf]
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model(2016), Soroush Mehri et al. [pdf]
WaveNet: A Generative Model for Raw Audio(2016), Aäron van den Oord et al. [pdf]
Char2Wav: End-to-end speech synthesis(2017), J Sotelo et al. [pdf]
Deep Voice: Real-time Neural Text-to-Speech(2017), Sercan O. Arik et al. [pdf]
Deep Voice 2: Multi-Speaker Neural Text-to-Speech(2017), Sercan Arik et al. [pdf]
Deep Voice 3: 2000-Speaker Neural Text-to-speech(2017), Wei Ping et al. [pdf]
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions(2017), Jonathan Shen et al. [pdf]
Parallel WaveNet: Fast High-Fidelity Speech Synthesis(2017), Aaron van den Oord et al. [pdf]
Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework(2017), S Yang et al. [pdf]
Tacotron: Towards End-to-End Speech Synthesis(2017), Yuxuan Wang et al. [pdf]
Uncovering Latent Style Factors for Expressive Speech Synthesis(2017), Yuxuan Wang et al. [pdf]
VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop(2017), Yaniv Taigman et al. [pdf]
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech(2018), Wei Ping et al. [pdf]
Deep Feed-forward Sequential Memory Networks for Speech Synthesis(2018), Mengxiao Bi et al. [pdf]
LPCNet: Improving Neural Speech Synthesis Through Linear Prediction(2018), Jean-Marc Valin et al. [pdf]
Learning latent representations for style control and transfer in end-to-end speech synthesis(2018), Ya-Jie Zhang et al. [pdf]
Neural Voice Cloning with a Few Samples(2018), Sercan O. Arık et al. [pdf]
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis(2018), Daisy Stanton et al. [pdf]
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis(2018), Y Wang et al. [pdf]
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron(2018), RJ Skerry-Ryan et al. [pdf]
DurIAN: Duration Informed Attention Network For Multimodal Synthesis(2019), Chengzhu Yu et al. [pdf]
Fast spectrogram inversion using multi-head convolutional neural networks(2019), SÖ Arık et al. [pdf]
FastSpeech: Fast, Robust and Controllable Text to Speech(2019), Yi Ren et al. [pdf]
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning(2019), Yu Zhang et al. [pdf]
MelNet: A Generative Model for Audio in the Frequency Domain(2019), Sean Vasquez et al. [pdf]
Multi-Speaker End-to-End Speech Synthesis(2019), Jihyun Park et al. [pdf]
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis(2019), Kundan Kumar et al. [pdf]
Neural Speech Synthesis with Transformer Network(2019), Naihan Li et al. [pdf]
Parallel Neural Text-to-Speech(2019), Kainan Peng et al. [pdf]
Pre-trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis(2019), Bing Yang et al.[pdf]
Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram(2019), Ryuichi Yamamoto et al. [pdf] it comes out the same time as MelGAN, while no one refers to each other...Besides, I think the gaussian noise is unnecessary, since melspec has very strong information.
Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN(2019), David Alvarez et al. [pdf]
Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS(2019), Mutian He et al. [pdf]
Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models(2019), Wei Fang et al. [pdf]
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis(2019), Ye Jia et al. [pdf]
WaveFlow: A Compact Flow-based Model for Raw Audio(2019), Wei Ping et al. [pdf]
Waveglow: A flow-based generative network for speech synthesis(2019), R Prenger et al. [pdf]
AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignmen(2020), Zhen Zeng et al. [pdf]
BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization(2020), Henry B.Moss et al. [pdf]
Bunched LPCNet : Vocoder for Low-cost Neural Text-To-Speech Systems(2020), Ravichander Vipperla et al. [pdf]
CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech(2020), Sri Karlapati et al. [pdf]
EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture(2020), Chenfeng Miao et al. [pdf]
End-to-End Adversarial Text-to-Speech(2020), Jeff Donahue et al. [pdf]
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech(2020), Yi Ren et al. [pdf]
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis(2020), Rafael Valle et al. [pdf]
Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow(2020), Chenfeng Miao et al. [pdf]
Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis(2020), Guangzhi Sun et al. [pdf]
Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior(2020), Guangzhi Sun et al. [pdf]
Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search(2020), Jaehyeon Kim et al. [pdf]
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis(2020), Jungil Kong et al. [pdf]
Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesi(2020), Eric Battenberg et al. [pdf]
MultiSpeech: Multi-Speaker Text to Speech with Transformer(2020), Mingjian Chen et al. [pdf]
Parallel Tacotron: Non-Autoregressive and Controllable TTS(2020), Isaac Elias et al. [pdf]
RobuTrans: A Robust Transformer-Based Text-to-Speech Model(2020), Naihan Li et al. [pdf]
Text-Independent Speaker Verification with Dual Attention Network(2020), Jingyu Li et al. [pdf]
WaveGrad: Estimating Gradients for Waveform Generation(2020), Nanxin Chen et al. [pdf]
AdaSpeech: Adaptive Text to Speech for Custom Voice(2021), Mingjian Chen et al. [pdf]
A Survey on Neural Speech Synthesis(2021), Xu Tan et al. [pdf]
A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate(2021), Ahmed Mustafa et al. [pdf]
Controllable cross-speaker emotion transfer for end-to-end speech synthesis(2021), Tao Li et al. [pdf]
Cloning one’s voice using very limited data in the wild(2021), Dongyang Dai et al. [pdf]
Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech(2021), Jaehyeon Kim et al. [pdf]
DiffWave: A Versatile Diffusion Model for Audio Synthesis(2021), Zhifeng Kong et al. [pdf]
Diff-TTS: A Denoising Diffusion Model for Text-to-Speech(2021), Myeonghun Jeong et al. [pdf]
DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021(2021), Yanqing Liu et al. [pdf]
Fre-GAN: Adversarial Frequency-consistent Audio Synthesis(2021), Ji-Hoon Kim et al. [pdf]
Full-band LPCNet: A real-time neural vocoder for 48 kHz audio with a CPU(2021), Keisuke Matsubara et al. [pdf]
Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech(2021), Vadim Popov et al. [pdf]
Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis(2021), Jian Cong et al. [pdf]
High-fidelity and low-latency universal neural vocoder based on multiband WaveRNN with data-driven linear prediction for discrete waveform modeling(2021), Patrick Lumban Tobing et al. [pdf]
Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis(2021), Chung-Ming Chien et al. [pdf]
ItoˆTTS and ItoˆWave: Linear Stochastic Differential Equation Is All You Need For Audio Generation(2021), Shoule Wu et al. [pdf]
JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech(2021), Dan Lim et al. [pdf]
meta-voice: fast few-shot style transfer for expressive voice cloning using meta learning(2021), Songxiang Liu et al. [pdf]
Neural HMMs are all you need (for high-quality attention-free TTS)(2021), Shivam Mehta et al. [pdf]
Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet(2021), Max Morrison et al. [pdf]
One TTS Alignment To Rule Them All(2021), Rohan Badlani et al. [pdf]
KaraTuner: Towards end to end natural pitch correction for singing voice in karaoke(2021), Xiaobin Zhuang et al. [pdf]
PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS(2021), Ye Jia et al. [pdf]
Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling(2021), Isaac Elias et al. [pdf]
PortaSpeech: Portable and High-Quality Generative Text-to-Speech(2021), Yi Ren et al. [pdf]
Transformer-based Acoustic Modeling for Streaming Speech Synthesis(2021), Chunyang Wu et al. [pdf]
Triple M: A Practical Neural Text-to-speech System With Multi-guidance Attention And Multi-band Multi-time Lpcnet(2021), Shilun Lin et al. [pdf]
TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction(2021), Stanislav Beliaev et al. [pdf] TalkNet2 has minor difference from TalkNet,so I don't include TalkNet here.
Towards Multi-Scale Style Control for Expressive Speech Synthesis(2021), Xiang Li et al. [pdf]
Unified Source-Filter GAN: Unified Source-filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN(2021), Reo Yoneyama et al. [pdf]
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone(2021), Edresson Casanova et al. [pdf]
Avocodo: Generative Adversarial Network for Artifact-free Vocoder(2022), Taejun Bak et al. [pdf]
Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech(2022), Byoung Jin Choi et al. [pdf]
Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge(2022), Sangjun Park et al. [pdf]
Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation(2022), Ryo Terashima et al. [pdf]
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis(2022), Rongjie Huang et al. [pdf]
Fast Grad-TTS: Towards Efficient Diffusion-Based Speech Generation on CPU(2022), Ivan Vovk et al. [[pdf]
Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion(2022), Yi Lei et al. [pdf]
HiFi++: a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement(2022), Pavel Andreev et al. [pdf]
IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion(2022), Wendong Gan et al. [pdf]
iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform(2022), Takuhiro Kaneko et al. [pdf]
Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform(2022), Masaya Kawamura et al. [pdf]
Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet(2022), Jean-Marc Valin et al. [pdf]
NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis(2022), Hyeong-Seok Choi et al. [pdf]
PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior(2022), Sang-gil Lee et al. [pdf]
PromptTTS: Controllable Text-to-Speech with Text Descriptions(2022), Zhifang Guo et al. [pdf]
SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech(2022), Hyunjae Cho et al. [pdf]
STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency(2022), Zhong-Qiu Wang et al. [pdf]
Simple and Effective Unsupervised Speech Synthesis(2022), Alexander H. Liu et al. [pdf]
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping(2022), Yuma Koizumi et al. [pdf]
Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder(2022), Reo Yoneyama et al. [pdf]
TriniTTS: Pitch-controllable End-to-end TTS without External Aligner(2022), Yoon-Cheol Ju et al. [pdf]
Zero-Shot Cross-Lingual Transfer Using Multi-Stream Encoder and Efficient Speaker Representation(2022), Yibin Zheng et al. [pdf]
InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt(2023), Dongchao Yang et al. [pdf]
Matcha-TTS: A fast TTS architecture with conditional flow matching(2023), Shivam Mehta et al. [pdf]
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias(2023), Ziyue Jiang et al. [pdf]
Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts(2023), Ziyue Jiang et al. [pdf]

Language Modelling

Class-Based n-gram Models of Natural Language(1992), Peter F. Brown et al. [pdf]
An empirical study of smoothing techniques for language modeling(1996), Stanley F. Chen et al. [pdf]
A Neural Probabilistic Language Model(2000), Yoshua Bengio et al. [pdf]
A new statistical approach to Chinese Pinyin input(2000), Zheng Chen et al. [pdf]
Discriminative n-gram language modeling(2007), Brian Roark et al. [pdf]
Neural Network Language Model for Chinese Pinyin Input Method Engine(2015), S Chen et al. [pdf]
Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition(2016), Xie Chen et al. [pdf]
Exploring the limits of language modeling(2016), R Jozefowicz et al. [pdf]
On the State of the Art of Evaluation in Neural Language Models(2016), G Melis et al. [pdf]
Pay Less Attention with Lightweight and Dynamic Convolutions(2019), Felix Wu et al.[pdf]

Confidence Estimates

Estimating Confidence using Word Lattices(1997), T. Kemp et al. [pdf]
Large vocabulary decoding and confidence estimation using word posterior probabilities(2000), G. Evermann et al. [pdf]
Combining Information Sources for Confidence Estimation with CRF Models(2011), M. S. Seigel et al. [pdf]
Speaker-Adapted Confidence Measures for ASR using Deep Bidirectional Recurrent Neural Networks(2018), M. ́A. Del-Agua et al. [pdf]
Bi-Directional Lattice Recurrent Neural Networks for Confidence Estimation(2018), Q. Li et al. [pdf]
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks(2020), A. Kastanos et al. [pdf]
CONFIDENCE ESTIMATION FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION(2020), Qiujia Li et al. [pdf]
Residual Energy-Based Models for End-to-End Speech Recognition(2021), Qiujia Li et al. [pdf]
Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction(2021), David Qiu et al. [pdf]

Music Modelling

Onsets and Frames: Dual-Objective Piano Transcription(2017), Curtis Hawthorne et al. [pdf]
Unsupervised Singing Voice Conversion(2019), Eliya Nachmani et al. [pdf]
ByteSing- A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders(2020), Yu Gu et al. [pdf]
DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System(2020), Liqiang Zhang et al. [pdf]
HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis(2020), Jiawei Chen et al. [pdf]
Jukebox: A Generative Model for Music(2020), Prafulla Dhariwal et al. [pdf]
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism(2021), Jinglin Liu et al. [pdf]
MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis(2021), Jaesung Tae et al. [pdf]
Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus(2021), Rongjie Huang et al. [pdf]
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training(2021), Mingliang Zeng et al. [pdf]
N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement(2021), Gyeong-Hoon Lee et al. [pdf]
Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech(2021), Raahil Shah et al. [pdf]
PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components(2021), Yukiya Hono et al. [pdf]
Sequence-to-Sequence Piano Transcription with Transformers(2021), Curtis Hawthorne et al. [pdf]
M4Singer: a Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus(2022), Lichao Zhang et al. [pdf]
Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis(2022), Yu Wang et al. [pdf]
WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses(2022), Zewang Zhang et al. [pdf]
WeSinger 2: Fully Parallel Singing Voice Synthesis via Multi-Singer Conditional Adversarial Training(2022), Zewang Zhang et al. [pdf]

Interesting papers

The Reversible Residual Network: Backpropagation Without Storing Activations(2017), Aidan N. Gomez et al. [pdf]
Soft-DTW: a Differentiable Loss Function for Time-Series(2018), Marco Cuturi et al. [pdf]
FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow(2019), Xuezhe Ma et al. [pdf]
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks(2019), Santiago Pascual et al. [pdf]
Self-supervised audio representation learning for mobile devices(2019), Marco Tagliasacchi et al. [pdf]
SinGAN: Learning a Generative Model from a Single Natural Image(2019), Tamar Rott Shaham et al. [pdf]
Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks(2019), Guanzhong Tian et al. [pdf]
Attention is Not Only a Weight: Analyzing Transformers with Vector Norms(2020), Goro Kobayashi et al. [pdf]