Awesome

VoiceLDM

This is a repository for the paper, VoiceLDM: Text-to-Speech with Environmental Context, ICASSP 2024.

VoiceLDM is an extension of text-to-audio models so that it is also capable of generating linguistically intelligible speech.

[2024/05 Update] I have now added the code for training VoiceLDM! Refer to Training for more details.

🔧 Installation

Install directly from GitHub

pip install git+https://github.com/glory20h/VoiceLDM.git

Install from source

git clone https://github.com/glory20h/VoiceLDM.git
cd VoiceLDM
pip install -e .

📖 Usage

Generate audio with description prompt and content prompt:

python generate.py --desc_prompt "She is talking in a park." --cont_prompt "Good morning! How are you feeling today?"

Generate audio with audio prompt and content prompt:

python generate.py --audio_prompt "whispering.wav" --cont_prompt "Good morning! How are you feeling today?"

Text-to-Speech Example:

python generate.py --desc_prompt "clean speech" --cont_prompt "Good morning! How are you feeling today?" --desc_guidance_scale 1 --cont_guidance_scale 9

Text-to-Audio Example:

python generate.py --desc_prompt "trumpet" --cont_prompt "_" --desc_guidance_scale 9 --cont_guidance_scale 1

Generated audios will be saved at the default output folder ./outputs.

💡 Tips for Better Audio Generation

Dual Classifier-Free Guidance Matters!

It's crucial to appropriately adjust the weights for dual classifier-free guidance. We find that this adjustment greatly influences the likelihood of obtaining satisfactory results. Here are some key tips:

Some weight settings are more effective for different prompts. Experiment with the weights and find the ideal combination that suits the specific use case.
Starting with 7 for both desc_guidance_scale and cont_guidance_scale is a good starting point.
If you feel that the generated audio doesn't align well with the provided content prompt, try decreasing the desc_guidance_scale and increase the cont_guidance_scale.
If you feel that the generated audio doesn't align well with the provided description prompt, try decreasing the cont_guidance_scale and increase the desc_guidance_scale.

⚙️ Full List of Options

View the full list of options with the following command:

python generate.py -h

💾 Data

The CSV files for the processed dataset used to train VoiceLDM can be found in here. These files include the transcriptions generated using the Whisper model.

Speech Segments

as_speech_en.csv (English speech segments from AudioSet)
cv1.csv (English speech segments from CommonVoice 13.0 en, it has been split into two to meet the file size limitations on GitHub.)
cv2.csv
voxceleb.csv (English speech segments from VoxCeleb1)

Non-Speech Segments

as_noise.csv (Non-speech segments from AudioSet)
noise_demand.csv (Non-speech segments from DEMAND)

🧠 Training

If you wish to train the model by yourself, follow these steps:

Configuration Setup (The trickiest part):
- Navigate to the configs folder to find the necessary configuration files. For example, VoiceLDM-M.yaml is used for training the VoiceLDM-M model in the paper.
- Prepare the CSV files used for training. You can download it here.
- Examine the YAML file and adjust the "paths" and "noise_paths" to the root path of your dataset. Also, take a look at the CSV files and ensure that the file_path in these CSV files match the actual file path names in your dataset.
- Update the paths for cv_csv_path1, cv_csv_path2, as_speech_en_csv_path, voxceleb_csv_path, as_noise_csv_path, and noise_demand_csv_path in the YAML file. You may optionally leave it blank if you do not wish to use the corresponding csv file and training data.
- You may also adjust other parameters such as the batch size according to your system's capabilities.
Configure Huggingface Accelerate:
- Set up Accelerate by running:
```
accelerate config
```
  This will allow support of CPU, single GPU, and multi-GPU training. Follow the on-screen instructions to configure your hardware settings.
Start Training:
- Launch the training process with the following example command:
```
accelerate launch train.py --config config/VoiceLDM-M.yaml
```
- Training checkpoints will be automatically saved in the results folder.

Running Inference:

Once training is complete, you can perform inference using the trained model by specifying the checkpoint path. For example:

python generate.py --ckpt_path results/VoiceLDM-M/checkpoints/checkpoint_49/pytorch_model.bin --desc_prompt "She is talking in a park." --cont_prompt "Good morning! How are you feeling today?"

🙏 Acknowledgements

This work would not have been possible without the following repositories:

HuggingFace Diffusers

HuggingFace Transformers