Awesome

SCHmUBERT

implementation of absorbing state diffusion model from https://github.com/samb-t/unleashing-transformers

Samples

Samples in MIDI format can be found in the samples folder. You can also explore them in your browser (open in new tab if page not found)

Installation

I run my experiments in Python 3.10, with all dependencies managed by Conda.

conda env create -f env.yml

Note that for all experiments, a soundfont-file called 'soundfont.sf2' (not included) must be located in the root-directory of the project.

Prepare Dataset

I use the Lakh MIDI Dataset to train the models. For loading, preprocessing and extracting melodies and trios from the MIDI files, I adapted the pipelines magenta implemented for their MusciVAE. To prepare the dataset run:

python prepare_data.py --root_dir=/path/to/lmd_full --target data/lakh_trio.npy --mode trio --bars 64

Train

I use visdom to log the training progress and periodically show samples.

To train the model, start visdom and run for example:

python train.py --dataset data/lakh_trio.npy --bars 64 --batch_size 64 --tracks trio --model conv_transformer

So far, I got the best results with the conv_transformer model with one 1DConvolutional layer with a width of 4. Pay attention to the steps_per_eval param, which is set to 10000 per default. The evaluation step is more computationally expensive than training for 10000 steps, which is why you might want to increase this value if you do not need that many evaluations.

Evaluate

To evaluate the framewise self-similarity metric on the samples generated by a model, run:

python evaluate.py --mode unconditional|infilling|self

Sample

For sampling, I ~~implemented~~ hacked a rudimentary GUI using nicegui.

python sample.py --load_step 140000 --bars 64 --tracks trio --model conv_transformer

The GUI supports:

visualizing samples (melody=red, bass=blue, drums=black), y position indicated pitch height, special pitch values: 0: pause, 1: note off, 90: mask
adaption of sample steps (Slider in Upload Expansion area)
diffuse from left to right ('=>') or vice versa ('<=')
copy from left to right ('>') or vice versa, only mask values are overwritten
sampling unconditionally (select 'A' in the central toggle to diffuse All (batch of 8) instead of the Selected sample)
uploading midi or musicxml - pieces for conditioning
masking whole tracks LM = Left Melody, RD = Right Drums, ....
masking area selected with mouse (mask button at the bottom)
playing with cursor indicating exact position in left and right visualization

Model Weights

Model weights for the Conv_Transformer EMA model trained on the Lakh-MIDI Dataset can be obtained here. Extract the 'logs' folder to the project root, and set load_step, model, ... accordingly (250000, conv_transformer, ...).