14.11 | Mustango: Toward Controllable Text-to-Music Generation | arXiv | GitHub | Hugging Face |
13.11 | Music ControlNet: Multiple Time-varying Controls for Music Generation | arXiv | - | - |
02.11 | E3 TTS: Easy End-to-End Diffusion-based Text to Speech | arXiv | - | - |
01.10 | UniAudio: An Audio Foundation Model Toward Universal Audio Generation | arXiv | GitHub | - |
24.09 | VoiceLDM: Text-to-Speech with Environmental Context | arXiv | GitHub | - |
05.09 | PromptTTS 2: Describing and Generating Voices with Text Prompt | arXiv | - | - |
14.08 | SpeechX: Neural Codec Language Model as a Versatile Speech Transformer | arXiv | - | - |
10.08 | AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining | arXiv | GitHub | Hugging Face |
09.08 | JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models | arXiv | - | - |
03.08 | MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies | arXiv | GitHub | - |
14.07 | Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts | arXiv | - | - |
10.07 | VampNet: Music Generation via Masked Acoustic Token Modeling | arXiv | GitHub | - |
22.06 | AudioPaLM: A Large Language Model That Can Speak and Listen | arXiv | - | - |
19.06 | Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale | PDF | GitHub | - |
08.06 | MusicGen: Simple and Controllable Music Generation | arXiv | GitHub | Hugging Face Colab |
06.06 | Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias | arXiv | - | - |
01.06 | Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis | arXiv | GitHub | - |
29.05 | Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation | arXiv | - | - |
25.05 | MeLoDy: Efficient Neural Music Generation | arXiv | - | - |
18.05 | CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training | arXiv | - | - |
18.05 | SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities | arXiv | GitHub | - |
16.05 | SoundStorm: Efficient Parallel Audio Generation | arXiv | GitHub (unofficial) | - |
03.05 | Diverse and Vivid Sound Generation from Text Descriptions | arXiv | - | - |
02.05 | Long-Term Rhythmic Video Soundtracker | arXiv | GitHub | - |
24.04 | TANGO: Text-to-Audio generation using instruction tuned LLM and Latent Diffusion Model | PDF | GitHub | Hugging Face |
18.04 | NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers | arXiv | GitHub (unofficial) | - |
10.04 | Bark: Text-Prompted Generative Audio Model | - | GitHub | Hugging Face Colab |
03.04 | AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models | arXiv | - | - |
08.03 | VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling | arXiv | - | - |
27.02 | I Hear Your True Colors: Image Guided Audio Generation | arXiv | GitHub | - |
08.02 | Noise2Music: Text-conditioned Music Generation with Diffusion Models | arXiv | - | - |
04.02 | Multi-Source Diffusion Models for Simultaneous Music Generation and Separation | arXiv | GitHub | - |
30.01 | SingSong: Generating musical accompaniments from singing | arXiv | - | - |
30.01 | AudioLDM: Text-to-Audio Generation with Latent Diffusion Models | arXiv | GitHub | Hugging Face |
30.01 | Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion | arXiv | GitHub | - |
29.01 | Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models | PDF | - | - |
28.01 | Noise2Music | - | - | - |
27.01 | RAVE2 [Samples RAVE1] | arXiv | GitHub | - |
26.01 | MusicLM: Generating Music From Text | arXiv | GitHub (unofficial) | - |
18.01 | Msanii: High Fidelity Music Synthesis on a Shoestring Budget | arXiv | GitHub | Hugging Face Colab |
16.01 | ArchiSound: Audio Generation with Diffusion | arXiv | GitHub | - |
05.01 | VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers | arXiv | GitHub (unofficial) (demo) | - |