Awesome
Make-An-Audio 3: Transforming Text into Audio via Flow-based Large Diffusion Transformers
PyTorch Implementation of Lumina-t2x, Lumina-Next
We will provide our implementation and pre-trained models as open-source in this repository recently.
News
- Oct, 2024: FlashAudio released.
- Sept, 2024: Lumina-Next accepted by NeurIPS'24.
- July, 2024: AudioLCM accepted by ACM-MM'24.
- June, 2024: Make-An-Audio 3 (Lumina-Next) released in Github and HuggingFace.
- May, 2024: AudioLCM released in Github and HuggingFace.
Install dependencies
Note: You may want to adjust the CUDA version according to your driver version.
conda create -n Make_An_Audio_3 -y
conda activate Make_An_Audio_3
conda install python=3.11 pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Install [nvidia apex](https://github.com/nvidia/apex) (optional)
Quick Started
Pretrained Models
Simply download the 500M weights from
Model | Config | Pretraining Data | Path |
---|---|---|---|
M (160M) | txt2audio-cfm-cfg | AudioCaption | TBD |
L (520M) | / | AudioCaption | [TBD] |
XL (750M) | txt2audio-cfm-cfg-XL | AudioCaption | Here |
XXL | txt2audio-cfm-cfg-XXL | AudioCaption | Here |
M (160M) | txt2music-cfm-cfg | Music | Here |
L (520M) | / | Music | [TBD] |
XL (750M) | / | Music | [TBD] |
3B | / | Music | [TBD] |
M (160M) | video2audio-cfm-cfg-moe | VGGSound | Here |
Generate audio/music from text
python3 scripts/txt2audio_for_2cap_flow.py --prompt {TEXT}
--outdir output_dir -r checkpoints_last.ckpt -b configs/txt2audio-cfm-cfg.yaml --scale 3.0
--vocoder-ckpt useful_ckpts/bigvnat
Add --test-dataset structure
for text-to-audio generation
Generate audio/music from audiocaps or musiccaps test dataset
- remember to alter
config["test_dataset"]
python3 scripts/txt2audio_for_2cap_flow.py
--outdir output_dir -r checkpoints_last.ckpt -b configs/txt2audio-cfm-cfg.yaml --scale 3.0
--vocoder-ckpt useful_ckpts/bigvnat --test-dataset testset
Generate audio from video
python3 scripts/video2audio_flow.py
--outdir output_dir -r checkpoints_last.ckpt -b configs/video2audio-cfm-cfg-moe.yaml --scale 3.0
--vocoder-ckpt useful_ckpts/bigvnat --test-dataset vggsound
Train flow-matching DiT
After trainning VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file. Run the following command to train Diffusion model
python main.py --base configs/txt2audio-cfm-cfg.yaml -t --gpus 0,1,2,3,4,5,6,7
Others
For Data preparation, Training variational autoencoder, Evaluation, Please refer to Make-An-Audio.
Acknowledgements
This implementation uses parts of the code from the following Github repos: Make-An-Audio, AudioLCM, CLAP, as described in our code.
Citations
If you find this code useful in your research, please consider citing:
@article{gao2024lumina-next,
title={Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT},
author={Zhuo, Le and Du, Ruoyi and Han, Xiao and Li, Yangguang and Liu, Dongyang and Huang, Rongjie and Liu, Wenze and others},
journal={arXiv preprint arXiv:2406.18583},
year={2024}
}
@article{huang2023make,
title={Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models},
author={Huang, Rongjie and Huang, Jiawei and Yang, Dongchao and Ren, Yi and Liu, Luping and Li, Mingze and Ye, Zhenhui and Liu, Jinglin and Yin, Xiang and Zhao, Zhou},
journal={arXiv preprint arXiv:2301.12661},
year={2023}
}
@article{gao2024lumin-t2x,
title={Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers},
author={Gao, Peng and Zhuo, Le and Liu, Chris and and Du, Ruoyi and Luo, Xu and Qiu, Longtian and Zhang, Yuhang and others},
journal={arXiv preprint arXiv:2405.05945},
year={2024}
}
Disclaimer
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.