Home

Awesome

Audio AI Timeline

Here we will keep track of the latest AI models for waveform based audio generation, starting in 2023!

2023

DateRelease [Samples]PaperCodeTrained Model
14.11Mustango: Toward Controllable Text-to-Music GenerationarXivGitHubHugging Face
13.11Music ControlNet: Multiple Time-varying Controls for Music GenerationarXiv--
02.11E3 TTS: Easy End-to-End Diffusion-based Text to SpeecharXiv--
01.10UniAudio: An Audio Foundation Model Toward Universal Audio GenerationarXivGitHub-
24.09VoiceLDM: Text-to-Speech with Environmental ContextarXivGitHub-
05.09PromptTTS 2: Describing and Generating Voices with Text PromptarXiv--
14.08SpeechX: Neural Codec Language Model as a Versatile Speech TransformerarXiv--
10.08AudioLDM 2: Learning Holistic Audio Generation with Self-supervised PretrainingarXivGitHubHugging Face
09.08JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion ModelsarXiv--
03.08MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup StrategiesarXivGitHub-
14.07Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech PromptsarXiv--
10.07VampNet: Music Generation via Masked Acoustic Token ModelingarXivGitHub-
22.06AudioPaLM: A Large Language Model That Can Speak and ListenarXiv--
19.06Voicebox: Text-Guided Multilingual Universal Speech Generation at ScalePDFGitHub-
08.06MusicGen: Simple and Controllable Music GenerationarXivGitHubHugging Face Colab
06.06Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive BiasarXiv--
01.06Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesisarXivGitHub-
29.05Make-An-Audio 2: Temporal-Enhanced Text-to-Audio GenerationarXiv--
25.05MeLoDy: Efficient Neural Music GenerationarXiv--
18.05CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-trainingarXiv--
18.05SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational AbilitiesarXivGitHub-
16.05SoundStorm: Efficient Parallel Audio GenerationarXivGitHub (unofficial)-
03.05Diverse and Vivid Sound Generation from Text DescriptionsarXiv--
02.05Long-Term Rhythmic Video SoundtrackerarXivGitHub-
24.04TANGO: Text-to-Audio generation using instruction tuned LLM and Latent Diffusion ModelPDFGitHubHugging Face
18.04NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing SynthesizersarXivGitHub (unofficial)-
10.04Bark: Text-Prompted Generative Audio Model-GitHubHugging Face Colab
03.04AUDIT: Audio Editing by Following Instructions with Latent Diffusion ModelsarXiv--
08.03VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language ModelingarXiv--
27.02I Hear Your True Colors: Image Guided Audio GenerationarXivGitHub-
08.02Noise2Music: Text-conditioned Music Generation with Diffusion ModelsarXiv--
04.02Multi-Source Diffusion Models for Simultaneous Music Generation and SeparationarXivGitHub-
30.01SingSong: Generating musical accompaniments from singingarXiv--
30.01AudioLDM: Text-to-Audio Generation with Latent Diffusion ModelsarXivGitHubHugging Face
30.01Moûsai: Text-to-Music Generation with Long-Context Latent DiffusionarXivGitHub-
29.01Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion ModelsPDF--
28.01Noise2Music---
27.01RAVE2 [Samples RAVE1]arXivGitHub-
26.01MusicLM: Generating Music From TextarXivGitHub (unofficial)-
18.01Msanii: High Fidelity Music Synthesis on a Shoestring BudgetarXivGitHubHugging Face Colab
16.01ArchiSound: Audio Generation with DiffusionarXivGitHub-
05.01VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech SynthesizersarXivGitHub (unofficial) (demo)-