Home

Awesome

[ACM-MM 2024]AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao

PyTorch Implementation of [AudioLCM (ACM-MM'24)]: an efficient and high-quality text-to-audio generation with latent consistency model.

arXiv Hugging Face GitHub Stars

We provide our implementation and pretrained models as open-source in this repository.

Visit our demo page for audio samples.

AudioLCM HuggingFace Space

News

Quick Started

We provide an example of how you can generate high-fidelity samples quickly using AudioLCM.

Download the AudioLCM model and generate audio from a text prompt:

from pythonscripts.InferAPI import AudioLCMInfer

prompt="Constant rattling noise and sharp vibrations"
config_path="./audiolcm.yaml"
model_path="./audiolcm.ckpt"
vocoder_path="./model/vocoder"
audio_path = AudioLCMInfer(prompt, config_path=config_path, model_path=model_path, vocoder_path=vocoder_path)

Use the AudioLCMBatchInfer function to generate multiple audio samples for a batch of text prompts:

from pythonscripts.InferAPI import AudioLCMBatchInfer

prompts=[
    "Constant rattling noise and sharp vibrations",
    "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle",
    "Humming and vibrating with a man and children speaking and laughing"
        ]
config_path="./audiolcm.yaml"
model_path="./audiolcm.ckpt"
vocoder_path="./model/vocoder"
audio_path = AudioLCMBatchInfer(prompts, config_path=config_path, model_path=model_path, vocoder_path=vocoder_path)

To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.

Pretrained Models

Simply download the weights from Huggingface.

<!-- Download bert-base-uncased weights from [Hugging Face](https://huggingface.co/google-bert/bert-base-uncased). Down load t5-v1_1-large weights from [Hugging Face](https://huggingface.co/google/t5-v1_1-large). Download CLAP weights from [Hugging Face](https://huggingface.co/microsoft/msclap/blob/main/CLAP_weights_2022.pth). -->
Download:
    audiolcm.ckpt and put it into ./ckpts  
    BigVGAN vocoder and put it into ./vocoder/logs/bigvnat16k93.5w  
    t5-v1_1-large and put it into ./ldm/modules/encoders/CLAP
    bert-base-uncased and put it into ./ldm/modules/encoders/CLAP
    CLAP_weights_2022.pth and put it into ./wav_evaluation/useful_ckpts/CLAP
<!-- The directory structure should be: ``` useful_ckpts/ ├── bigvgan │ ├── args.yml │ └── best_netG.pt ├── CLAP │ ├── config.yml │ └── CLAP_weights_2022.pth └── maa1_full.ckpt ``` -->

Dependencies

See requirements in requirement.txt:

Inference with a pre-trained model

python scripts/txt2audio_for_lcm.py  --ddim_steps 2 -b configs/audiolcm.yaml --sample_rate 16000 --vocoder-ckpt  vocoder/logs/bigvnat16k93.5w --outdir results --test-dataset audiocaps  -r ckpt/audiolcm.ckpt

Dataset preparation

Generate the melspec file of audio

Assume you have already got a tsv file to link each caption to its audio_path, which means the tsv_file has "name","audio_path","dataset" and "caption" columns in it. To get the melspec of audio, run the following command, which will save mels in ./processed

python ldm/data/preprocess/mel_spec.py --tsv_path tmp.tsv

Add the duration into the tsv file

python ldm/data/preprocess/add_duration.py

Train variational autoencoder

Assume we have processed several datasets, and save the .tsv files in data/*.tsv . Replace data.params.spec_dir_path with the data(the directory that contain tsvs) in the config file. Then we can train VAE with the following command. If you don't have 8 gpus in your machine, you can replace --gpus 0,1,...,gpu_nums

python main.py --base configs/train/vae.yaml -t --gpus 0,1,2,3,4,5,6,7

The training result will be saved in ./logs/

Train latent diffsuion

After Training VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file. Run the following command to train the Diffusion model

python main.py --base configs/autoencoder1d.yaml -t  --gpus 0,1,2,3,4,5,6,7

The training result will be saved in ./logs/

Evaluation

Please refer to Make-An-Audio

Acknowledgements

This implementation uses parts of the code from the following Github repos: Make-An-Audio CLAP, Stable Diffusion, as described in our code.

Citations

If you find this code useful in your research, please consider citing:

@misc{liu2024audiolcm,
      title={AudioLCM: Text-to-Audio Generation with Latent Consistency Models}, 
      author={Huadai Liu and Rongjie Huang and Yang Liu and Hengyuan Cao and Jialei Wang and Xize Cheng and Siqi Zheng and Zhou Zhao},
      year={2024},
      eprint={2406.00356},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.