Awesome

Creative Text-to-Audio Generation via Synthesizer Programming (CTAG)

GitHub license

Code for the ICML 2024 paper Creative Text-to-Audio Generation via Synthesizer Programming. CTAG is a method for generating sounds from text prompts by using a virtual modular synthesizer. CTAG depends on SynthAX, a fast modular synthesizer in JAX.

You can hear many examples on the website. The code to obtain the results from the paper will be found in a different repository (coming soon).

Installation

You can create the environment as follows

conda create -n ctag python=3.9
conda activate ctag
pip install -r requirements.txt

By default, we install JAX for CPU. You can find more details in the JAX documentation on using JAX with your accelerators.

CLAP Checkpoints

You also have to download the checkpoints for LAION-CLAP as follows:

mkdir -p ctag/checkpoints && wget -i checkpoints.txt -P ctag/checkpoints

`ctag/`

Generating sounds is very simple! By default, ctag runs on CPU with a lower population size, but you can change that with config values

cd ctag
python text2synth.py system.device=cuda general.popsize=100

It will generate directories containing logs, results, and experiments. The final version of each sound can be found in experiments, and results contains all the iterations.

By default, this uses the prompts in ctag/data/esc50-sounds.txt. To change this, point this field to a different file or pass a string with multiple semicolon-separated prompts. You can also override this from the command line:

# From a prompts.txt file
python text2synth.py general.prompts=/path/to/prompts.txt

# From strings
python text2synth.py general.prompts='"a bird tweeting;walking on leaves"'

Note that currently, you must supply $\geq$ 2 prompts! This is due to an issue in the version of CLAP on PyPI.

Configuration

We use Hydra to configure ctag. The configuration can be found in ctag/conf/config.yaml, with specific sub-configs in sub-directories of ctag/conf/.

The configs define all the parameters (e.g. strategy algorithm, synthesizer, iterations, prompts). By default, these are the ones used for the paper. You can choose the model according to the downloaded CLAP checkpoints, an evosax strategy available in the configuration, a synth architecture and a synthconfig. This is also where you choose the prompts, the duration of the sounds, the number of iterations, the popsize (population size), the number of independent runs per prompt n_runs (not to confuse with the iterations), and the initial random seed.

Hyperparameters

We use AX to sweep the hyperparameters of an algorithm with just a config field. First, you need to update the version of ax-platform because of some dependency issues with other packages

pip install ax-platform==0.2.8

Then you can run the sweeping as follows

python text2synth.py --multirun

Acknowledgements & Citing

If you use ctag in your research, please cite the following paper:

@inproceedings{cherep2024creative,
  title={Creative Text-to-Audio Generation via Synthesizer Programming},
  author={Cherep, Manuel and Singh, Nikhil and Shand, Jessica},
  booktitle={Forty-first International Conference on Machine Learning},
  year={2024}
}

For the synthesizer component itself, please cite SynthAX:

@conference{cherep2023synthax,
  title = {SynthAX: A Fast Modular Synthesizer in JAX},
  author = {Cherep*, Manuel and Singh*, Nikhil},
  booktitle = {Audio Engineering Society Convention 155},
  month = {May},
  year = {2023},
  url = {http://www.aes.org/e-lib/browse.cfm?elib=22261}
}

We acknowledge partial financial support by Fulbright Spain. We also acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the research results reported within these papers.