Awesome
End-to-End Speech Translation Progress
Tutorial
- EACL 2021 tutorial: Speech Translation
- Blog: Getting Started with End-to-End Speech Translation
- ACL 2020 Theme paper: Speech Translation and the End-to-End Promise: Taking Stock of Where We Are
- INTERSPEECH 2019 survey talk: Spoken Language Translation
Data
Corpus | Direction | Target | Duration | License |
---|---|---|---|---|
CoVoST 2 | {Fr, De, Es, Ca, It, Ru, Zh, Pt, Fa, Et, Mn, Nl, Tr, Ar, Sv, Lv, Sl, Ta, Ja, Id, Cy} -> En and En -> {De, Ca, Zh, Fa, Et, Mn, Tr, Ar, Sv, Lv, Sl, Ta, Ja, Id, Cy} | Text | 2880h | CC0 |
CVSS | {Fr, De, Es, Ca, It, Ru, Zh, Pt, Fa, Et, Mn, Nl, Tr, Ar, Sv, Lv, Sl, Ta, Ja, Id, Cy} -> En | Text & Speech | 1900h | CC BY 4.0 |
mTEDx | {Es, Fr, Pt, It, Ru, El} -> En, {Fr, Pt, It} -> Es, Es -> {Fr, It}, {Es,Fr} -> Pt | Text | 765h | CC BY-NC-ND 4.0 |
CoVoST | {Fr, De, Nl, Ru, Es, It, Tr, Fa, Sv, Mn, Zh} -> En | Text | 700h | CC0 |
MUST-C & MUST-Cinema | En -> {De, Es, Fr, It, Nl, Pt, Ro, Ru, Ar, Cs, Fa, Tr, Vi, Zh} | Text | 504h | CC BY-NC-ND 4.0 |
How2 | En -> Pt | Text | 300h | Youtube & CC BY-SA 4.0 |
Augmented LibriSpeech | En -> Fr | Text | 236h | CC BY 4.0 |
Europarl-ST | {En, Fr, De, Es, It, Pt, Pl, Ro, Nl} -> {En, Fr, De, Es, It, Pt, Pl, Ro, Nl} | Text | 280h | CC BY-NC 4.0 |
Kosp2e | Ko -> En | Text | 198h | Mixed CC |
Fisher + Callhome | Es -> En | Text | 160h+20h | LDC |
MaSS | parallel among En, Es, Eu, Fi, Fr, Hu, Ro and Ru | Text & Speech | 172h | Bible.is |
LibriVoxDeEn | De -> En | Text | 110h | CC BY-NC-SA 4.0 |
Prabhupadavani | parallel among En, Fr, De, Gu, Hi, Hu, Id, It, Lv, Lt, Ne, Fa, Pl, Pt, Ru, Sl, Sk, Es, Se, Ta, Te, Tr, Bg, Hr, Da and Nl | Text | 94h | |
BSTC | Zh -> En | Text | 68h | |
LibriS2S | De <-> En | Text & Speech | 52h/57h | CC BY-NC-SA 4.0 |
Toolkit
Paper
2023
- [arXiv] Tuning Large language model for End-to-end Speech Translation
- [arXiv] Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning
- [arXiv] Multilingual Speech-to-Speech Translation into Multiple Target Languages
- [ICCV] MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
- [INTERSPEECH] MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
- [INTERSPEECH] Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer
- [INTERSPEECH] Joint Speech Translation and Named Entity Recognition
- [INTERSPEECH] StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation
- [INTERSPEECH] Knowledge Distillation on Joint Task End-to-End Speech Translation
- [INTERSPEECH] GigaST: A 10,000-hour Pseudo Speech Translation Corpus
- [INTERSPEECH] Inter-connection: Effective Connection between Pre-trained Encoder and Decoder for Speech Translation
- [INTERSPEECH] AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation
- [INTERSPEECH] Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff
- [INTERSPEECH] HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation
- [INTERSPEECH] Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models
- [ICML] Pre-training for Speech Translation: CTC Meets Optimal Transport
- [ACL] UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
- [ACL] Simple and effective unsupervised speech translation
- [ACL] BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric
- [ACL] SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations
- [ACL] Understanding and Bridging the Modality Gap for Speech Translation
- [ACL] Back Translation for Speech-to-text Translation Without Transcripts
- [ACL] AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
- [ACL] WACO: Word-Aligned Contrastive Learning for Speech Translation
- [ACL] Attention as a guide for Simultaneous Speech Translation
- [ACL Findings] Speech-to-Speech Translation for a Real-world Unwritten Language
- [ACL Findings] CKDST: Comprehensively and Effectively Distill Knowledge from Machine Translation to End-to-End Speech Translation
- [ACL Findings] Duplex Diffusion Models Improve Speech-to-Speech Translation
- [ACL Findings] DUB: Discrete Unit Back-translation for Speech Translation
- [ACL Findings] End-to-End Simultaneous Speech Translation with Differentiable Segmentation
- [ACL Findings] Implicit Memory Transformer for Computationally Efficient Simultaneous Speech Translation
- [ACL Findings] Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data
- [ICASSP] Textless Direct Speech-to-Speech Translation with Discrete Speech Representation
- [ICASSP] M3ST: Mix at Three Levels for Speech Translation
- [EACL Findings] Generating Synthetic Speech from SpokenVocab for Speech Translation
- [AAAI] Improving End-to-end Speech Translation by Leveraging Auxiliary Speech and Text Data
2022
- [arXiv] AdaTranS: Adapting with Boundary-based Shrinking for End-to-End Speech Translation
- [arXiv] Direct Speech-to-speech Translation without Textual Annotation using Bottleneck Features
- [arXiv] ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic - English
- [arXiv] Multilingual Simultaneous Speech Translation
- [arXiv] Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages
- [EMNLP Findings] Does Simultaneous Speech Translation need Simultaneous Models?
- [EMNLP Findings] RedApt: An Adaptor for WAV2VEC 2 Encoding Faster and Smaller Speech Translation without Quality Compromise
- [INTERSPEECH] Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation
- [INTERSPEECH] Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation
- [INTERSPEECH] Large-Scale Streaming End-to-End Speech Translation with Neural Transducers
- [INTERSPEECH] Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation
- [INTERSPEECH] Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation
- [INTERSPEECH] SHAS: Approaching optimal Segmentation for End-to-End Speech Translation
- [INTERSPEECH] Exploring Continuous Integrate-and-Fire for Adaptive Simultaneous Speech Translation
- [INTERSPEECH] M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation
- [NAACL] Textless Speech-to-Speech Translation on Real Data
- [ICML] Revisiting End-to-End Speech-to-Text Translation From Scratch
- [ICML] Translatotron 2: Robust direct speech-to-speech translation
- [ACL] Learning When to Translate for Streaming Speech
- [ACL] Sample, Translate, Recombine: Leveraging Audio Alignments for Data Augmentation in End-to-end Speech Translation
- [ACL] UniST: Unified End-to-end Model for Streaming and Non-streaming Speech Translation
- [ACL] Direct speech-to-speech translation with discrete units
- [ACL] STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation
- [ACL Findings] End-to-End Speech Translation for Code Switched Speech
- [LREC] CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
- [LREC] LibriS2S: A German-English Speech-to-Speech Translation Corpus
- [ICASSP] Tackling data scarcity in speech translation using zero-shot multilingual machine translation techniques
- [NN] Improving data augmentation for low resource speech-to-text translation with diverse paraphrasing
- [AAAI] Regularizing End-to-End Speech Translation with Triangular Decomposition Agreement
2021
- [arXiv] Efficient Transformer for Direct Speech Translation
- [arXiv] Zero-shot Speech Translation
- [arXiv] Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention
- [ASRU] Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates
- [ASRU] Assessing Evaluation Metrics for Speech-to-Speech Translation
- [ASRU] Enabling Zero-shot Multilingual Spoken Language Translation with Language-Specific Encoders and Decoders
- [ICNLSP] Beyond Voice Activity Detection: Hybrid Audio Segmentation for Direct Speech Translation
- [INTERSPEECH] Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation
- [EMNLP] Speechformer: Reducing Information Loss in Direct Speech Translation
- [EMNLP] Is "moby dick" a Whale or a Bird? Named Entities and Terminology in Speech Translation
- [EMNLP] Mutual-Learning Improves End-to-End Speech Translation
- [INTERSPEECH] End-to-end Speech Translation via Cross-modal Progressive Training
- [INTERSPEECH] CoVoST 2 and Massively Multilingual Speech-to-Text Translation
- [INTERSPEECH] The Multilingual TEDx Corpus for Speech Recognition and Translation
- [INTERSPEECH] Large-Scale Self-and Semi-Supervised Learning for Speech Translation
- [INTERSPEECH] Kosp2e: Korean Speech to English Translation Corpus
- [INTERSPEECH] AlloST: Low-resource Speech Translation without Source Transcription
- [INTERSPEECH] SpecRec: An Alternative Solution for Improving End-to-End Speech-to-Text Translation via Spectrogram Reconstruction
- [INTERSPEECH] Optimally Encoding Inductive Biases into the Transformer Improves End-to-End Speech Translation
- [INTERSPEECH] ASR Posterior-based Loss for Multi-task End-to-end Speech Translation
- [AMTA] Simultaneous Speech Translation for Live Subtitling: from Delay to Display
- [ACL] Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders
- [ACL] Multilingual Speech Translation with Efficient Finetuning of Pretrained Models
- [ACL] Lightweight Adapter Tuning for Multilingual Speech Translation
- [ACL] Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?
- [ACL] Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task
- [ACL] Beyond Sentence-Level End-to-End Speech Translation: Context Helps
- [ACL Findings] Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR
- [ACL Findings] AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation
- [ACL Findings] RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer
- [ACL Findings] Learning Shared Semantic Space for Speech-to-Text Translation
- [ACL Findings] Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation
- [ACL Findings] How to Split: the Effect of Word Segmentation on Gender Bias in Speech Translation
- [ACL Demo] NeurST: Neural Speech Translation Toolkit
- [ICML] Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation
- [NAACL] Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation
- [NAACL] Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks
- [NAACL AutoSimTrans] BSTC: A Large-Scale Chinese-English Speech Translation Dataset
- [AmericasNLP] Highland Puebla Nahuatl–Spanish Speech Translation Corpus for Endangered Language Documentation
- [ICASSP] Task Aware Multi-Task Learning for Speech to Text Tasks
- [ICASSP] A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks
- [ICASSP] An Empirical Study of End-to-end Simultaneous Speech Translation Decoding Strategies
- [ICASSP] Streaming Simultaneous Speech Translation with Augmented Memory Transformer
- [ICASSP] Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder
- [ICASSP] Cascaded Models With Cyclic Feedback For Direct Speech Translation
- [ICASSP] Jointly Trained Transformers models for Spoken Language Translation
- [ICASSP] Efficient Use of End-to-end Data in Spoken Language Processing
- [EACL] CTC-based Compression for Direct Speech Translation
- [EACL] Streaming Models for Joint Speech Recognition and Translation
- [IberSPEECH] mintzai-ST: Corpus and Baselines for Basque-Spanish Speech Translation
- [AAAI] Consecutive Decoding for Speech-to-text Translation
- [AAAI] UWSpeech: Speech to Speech Translation for Unwritten Languages
- [AAAI] "Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation
- [SLT] Tight Integrated End-to-End Training for Cascaded Speech Translation
- [SLT] Transformer-based Direct Speech-to-speech Translation with Transcoder
2020
- [arXiv] Bridging the Modality Gap for Speech-to-Text Translation
- [arXiv] CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning
- [CLiC-IT] On Knowledge Distillation for Direct Speech Translation
- [COLING] Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation
- [COLING] Breeding Gender-aware Direct Speech Translation Systems
- [AACL] SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation
- [AACL Demo] fairseq S2T: Fast Speech-to-Text Modeling with fairseq
- [EMNLP] Effectively pretraining a speech translation decoder with Machine Translation data
- [EMNLP Findings] Adaptive Feature Selection for End-to-End Speech Translation
- [AMTA] On Target Segmentation for Direct Speech Translation
- [INTERSPEECH] Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
- [INTERSPEECH] Relative Positional Encoding for Speech Recognition and Direct Translation
- [INTERSPEECH] Contextualized Translation of Automatically Segmented Speech
- [INTERSPEECH] Self-Training for End-to-End Speech Translation
- [INTERSPEECH] Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation
- [INTERSPEECH] Self-Supervised Representations Improve End-to-End Speech Translation
- [INTERSPEECH] Investigating Self-Supervised Pre-Training for End-to-End Speech Translation
- [TACL] Consistent Transcription and Translation of Speech
- [ACL] Worse WER, but Better BLEU? Leveraging Word Embedding as Intermediate in Multitask End-to-End Speech Translation
- [ACL] Phone Features Improve Speech Translation
- [ACL] Curriculum Pre-training for End-to-End Speech Translation
- [ACL] SimulSpeech: End-to-End Simultaneous Speech to Text Translation
- [ACL] Gender in Danger? Evaluating Speech Translation Technology on the MuST-SHE Corpus
- [ACL Theme] Speech Translation and the End-to-End Promise: Taking Stock of Where We Are
- [ACL Demo] ESPnet-ST: All-in-One Speech Translation Toolkit
- [LREC] CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus
- [LREC] MuST-Cinema: a Speech-to-Subtitles corpus
- [LREC] MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible
- [LREC] LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech Recognition
- [ICASSP] Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates
- [ICASSP] Instance-Based Model Adaptation For Direct Speech Translation
- [ICASSP] Data Efficient Direct Speech-to-Text Translation with Modality Agnostic Meta-Learning
- [ICASSP] Analyzing ASR pretraining for low-resource speech-to-text translation
- [ICASSP] End-to-End Speech Translation with Self-Contained Vocabulary Manipulation
- [AAAI] Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation
- [AAAI] Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding
2019
- [ASRU] One-To-Many Multilingual End-to-end Speech Translation
- [ASRU] Multilingual End-to-End Speech Translation
- [ASRU] Speech-to-speech Translation between Untranscribed Unknown Languages
- [ASRU] A Comparative Study on End-to-end Speech to Text Translation
- [IWSLT] Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade
- [IWSLT] On Using SpecAugment for End-to-End Speech Translation
- [INTERSPEECH] End-to-End Speech Translation with Knowledge Distillation
- [INTERSPEECH] Adapting Transformer to End-to-end Spoken Language Translation
- [INTERSPEECH] Direct speech-to-speech translation with a sequence-to-sequence model
- [ACL] Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation
- [ACL] Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation
- [NAACL] Pre-training on High-Resource Speech Recognition Improves Low-Resource Speech-to-Text Translation
- [NAACL] MuST-C: a Multilingual Speech Translation Corpus
- [NAACL] Fluent Translations from Disfluent Speech in End-to-End Speech Translation
- [ICASSP] Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation
- [ICASSP] Towards unsupervised speech-to-text translation
- [ICASSP] Towards End-to-end Speech-to-text Translation with Two-pass Decoding
2018
- [NIPS] How2: A Large-scale Dataset for Multimodal Language Understanding
- [IberSPEECH] End-to-End Speech Translation with the Transformer
- [INTERSPEECH] Low-Resource Speech-to-Text Translation
- [LREC] Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation
- [NAACL] Tied multitask learning for neural speech translation
- [ICASSP] End-to-End Automatic Speech Translation of Audiobooks
2017
- [INTERSPEECH] Sequence-to-Sequence Models Can Directly Translate Foreign Speech
- [INTERSPEECH] Structured-based Curriculum Learning for End-to-end English-Japanese Speech Translation
- [EACL] Towards speech-to-text translation without speech recognition
2016
- [NIPS Workshop] Listen and translate: A proof of concept for end-to-end speech-to-text translation
- [NAACL] An Attentional Model for Speech Translation Without Transcription
2013
Contact
Changhan Wang (wangchanghan@gmail.com)