Awesome
vschaos2: vintage neural audio synthesis
vschaos2
, based on the original 2019 vschaos
package [1][2], is a vintage-flavoured neural audio synthesis package. It is based on unsupervised / (semi-)supervised training of spectral information using variational auto-encoders [3], allowing direct manipulation in Max/MSP or PureData environments using nn_tilde. vschaos2
may be trained on small batches of data, is very light, usable using small architectures, and can also be used for auxiliary predictive tasks.
See an example of vschaos in live action here.
Quick Start
- Download the current repository.
- If you don't have PureData or Max, download PureData for free or Max for evaluation.
- Install nn~ following instructions here.
- Download the
ordinario_8192.ts
pre-trained model on the pre-trained models Dropbox folder into thepatches
subfolder if using Max, or in theexternals
sub-folder or Pd Library. - Open
pd_example.pd
ormax_example.maxpat
, depending the choice you made at step 1. - Play!
Full install
Installing vschaos2
Use your favorite environment manager to create a specific python environment for vschaos2
(we advise to use the miniconda package manager). Using miniconda :
cd your_target_folder
git clone http://www.github.com/acids_ircam/vschaos2
cd vschaos2
conda create -n vschaos python=3.8 setuptools==59.5.0
conda activate vschaos
pip install -r requirements.txt
You may need to manually install pytorch before pip install
if you got specific CUDA needs : find more information here.
Data management
The only requirements for a dataset is to put all the target audio files in a subfolder data/
, and the corresponding metadata files in a subfolder metadata/
. Hence, the directory tree must be like
my_dataset/
data/
[...files]
metadata/
task_1/
metadata.txt
classes.txt [optional]
task_2/
...
Metadata
Metadata can be specified in two different ways. If every file belong to a single class, metadata may only containt the metadata.txt
files wit the following information :
data/..../file1.wav metadata
data/..../file2.wav metadata
...
where key and field are separated by a \t
character. If metadata changes during the audio file, an additional folder tracks
folder must be added such that the metadata.txt
contains
data/..../file1.wav tracks/metadata_file1.txt
data/..../file2.wav tracks/metadata_file2.txt
...
and the folder tracks
contains the metadata for each audio file in a separated text file, containing
0.0,1.0 metadata1
1.0,1.025 metadata2
where onsets and offsets are indicated in seconds (separated by a comma), and are separated from metadata by a \t
character. Metadata can be float,int, or none
, hence allowing semi-supervised training.
Training a model
Unsupervised model
Model configurations are based on the hydra
configuration manager, using .yaml
to specify the VAE's architecture. Commented configurations are given in the configs/
folder, such that you only need to specify the data path and the tranining root / name :
python3 train_model.py --config-name dgt_mid name=my_training rundir=my_path +data.dataset.root=my_data
where --config-name
is followed by the name of a configuration file in the configs/
subfolder, my_training
is a customized training name, my_path
the location of the training, and my_data
is the path of a formatted dataset (see section data management).
Supervised model
Conditioning variational auto-encoders may be achieved in a semi-supervised fashion, allowing to constrain the generation to high-level external information [4].
Specifying metadata callbacks. By default, metadata is imported as int
labels. If you want a precise control over metadata parsing, additional arguments may be given to the dataset in the .yaml
file :
data:
dataset:
metadata_callbacks:
task_1:
name: read_single_metadata
type: int
task_2:
name: read_single_metadata
type: float
...
where you would use read_single_metadata
if metadata if each audio file belongs to a single class, and read_multi_metadata
if metadata is scattered as multiple text files in a tracks/
sub-folder.
Specifying conditioning tasks. To condition a model, you must add an additional conditioning
entry to the model
configuration :
model:
conditioning:
task_1:
dim: task_dim_1
target: ['encoder', 'decoder']
embedding: OneHot
task_2:
dim: task_dim_2
target: ['decoder']
embedding: OneHot
...
where dim
, target
, and embedding
keywords define respectively the dimensionality, target module, and embedding type for each task. Conditioning the decoder will encourage the model to disentangle the class information from the latent space, allowing to control the generation by giving the corresponding label. Conditioning the encoder will rather provide a different latent space by label, hence enforcing class separation (at the cost of being always provided the label for the encoding process).
Predicting labels. Additional predictors can be provided to the model, that will be available as an additional output in the nn~
module and that can be used for conditioning the decoder even if label information is unavailable. To add predictors to the model, you must add an additional prediction
entry to the model
configuration :
model:
prediction:
task_1:
dim: task_dim_1
type: MLPEncoder
args:
nlayers: 2
hidden_dims: 200
target_dist: Categorical
...
Prediction modules are trained besides the VAE (with a stop gradient before the classifier), such that the encoder is not influenced by the classification results. If label information is available, the classifier is trained to predict the corresponiding label; if it is not, it is trained on random values. Be careful to not predict a label that is conditioning the encoder!
Example with pitch & octave. A script is provided to generate pitch and octave metadata using the torchcrepe
package. After having moved the audio files into the /data
subfolder, generate metadata files by launching
python3 extract_pitch.py my_data
such that pitch
and octave
metadatas will be generated in the /metadata
sub-folder. Then, you can use the dgt_pitch_cond
config file to train a parametrized VAE on pitch
and octave
:
python3 train.py --config-name dgt_pitch_prediction name=my_training rundir=my_path data.dataset.root=my_data
such that the model's decoder will be conditioned by both pitch and octave, that can be provided by the predictor if the corresponding label is missing.
Tricks
- You can add the
+check=1
option totrain.py
to add a breakpoint before lauching the training loop, allowing to check the architecture of your model and your dataset. - You can also add the
+data.dataset.n_files=X
to only loadX
files from your dataset, allowing you to train on small data subsets to verify that your training setup works.
Generating with a trained model
Command-line generation
The script generate.py
allows to generate from a model using the command line. There two generation modes so far : reconstruct
, to reconstruct a given bunch of files thourgh the model, and trajectories
, leveraging the trajectories
library to generate latent space exploration.
Reconstruction. To reconstruct a file through a model, you can execute
python3 generate.py reconstruct path_to_rundir --files path_to_audio.wav path_to_audio_folder/ --sr 44100 --out path_to_out/
to export the generations in the path_to_out/reconstructions
folder.
Trajectories. To directly generate from the latent space from random trajectories, you can execute
python3 generate.py trajectories path_to_rundir --trajectories sin random line --sr 44100 -b 3 -s 512 --out path_to_out/
to generate 3 batches of random sin, random and line 512-stepped trajectories, that will be saved in the path_to_out/trajectories
.
Exporting for nn~
To export a model, just use the script.py
with the training folder :
python3 script.py path_to_model/ -o output_dir/ -v version_number -n name
where path_to_model/
is the training rundir
, version_number
is the version number of the training (if not indicated, take the last training) and name
the model name (default : last
). The .ts
file will be exported at path_to_model/
, and can be imported by nn~
.
References
[1] Chemla--Romeu-Santos, A. (2020). Manifold representations of musical signals and generative spaces. PhD thesis
[2] Chemla--Romeu-Santos, A., Esling, P., & Ntalampiras, S. (2020, October). Représentations variationnelles inversibles: une nouvelle approche pour la synthèse sonore. In Journées d'Informatique Musicale 2020 (JIM 2020).
[3] Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
[4] Kingma, D. P., Mohamed, S., Jimenez Rezende, D., & Welling, M. (2014). Semi-supervised learning with deep generative models. Advances in neural information processing systems, 27