Home

Awesome

MERT

This is the official implementation of the paper "MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training".

Evaluation, Benchmarking and Baselines:

Training

The MERT training is implemented with fairseq. You need to clone the fairseq repo inside our repo at ./src/fairseq and MERT implementation codes as a fairseq example projcet.

Environment Setup

The training of MERT requires:

You could use the script ./scripts/environment_setup.sh to set up the python environment from scarth, which could be easily modified to DOCKERFILE. All the relevant folders will be placed at the customized MERT repo folder path $MAP_PROJ_DIR.

Data Preparation

Generally, there are 2 things you need to prepare:

The two options for acoustic teacher peuso labels in MERT training can be constructed by:

Scripts for preparing the training data:

# First prepare the manifest file indexing the audios.
# If needed the audios will be converted to 24K Hz.
python scripts/prepare_manifest.py --root-dir /absolute/path/to/original/custom_audio_dataset \
      --target-rate 24000 --converted-root-dir /absolute/path/to/converted/custom_audio_dataset \
      --out-dir data/custom_audio_dataset_manifest --extension wav
      
# Prepare the codecs for audios in the manifest
python scripts/prepare_codecs_from_manifest.py  \
      --manifest_path data/custom_audio_dataset_manifest --manifest_file_name train.tsv \
      --out_root data/encodec_labels/custom_audio_dataset --codebook_size 1024 --n_codebook 8

The data preparation and format can be referred to HuBERT for more details.

Start Training

Noted that we follow the fariseq development protocol to put our codes as an example project. When running the fairseq program, you can specify the MERT customized codes by common.user_dir=${MAP_PROJ_DIR}/mert_faiseq.

After the environment is set up, you could use the following scripts:

# for MERT95M
bash scripts/run_training.sh 0 dummy MERT_RVQ-VAE_CQT_95M

# for MERT 330M
bash scripts/run_training.sh 0 dummy MERT_RVQ-VAE_CQT_330M

Inference

We use the huggingface models for interface and evaluation. Using the example of RVQ-VAE 95M MERT as example, the following codes show how to load and extract representations with MERT.

python MERT/scripts/MERT_demo_inference.py

Checkpoints

Huggingface Checkpoint

Our Huggingface Transformers checkpoints for convenient inference are uploaded to the m-a-p project page.

To convert your self-trained models, check the scripts:

bash scripts/convert_HF_script.sh default mert config_mert_base [/absolute/path/to/a/fairseq/checkpoint.pt]

Fairseq Checkpoint

We also provide the corresponding fairseq checkpoint for continual training or further modification hosted at the corresponding HF repos:

Citation

@misc{li2023mert,
      title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training}, 
      author={Yizhi Li and Ruibin Yuan and Ge Zhang and Yinghao Ma and Xingran Chen and Hanzhi Yin and Chenghua Lin and Anton Ragni and Emmanouil Benetos and Norbert Gyenge and Roger Dannenberg and Ruibo Liu and Wenhu Chen and Gus Xia and Yemin Shi and Wenhao Huang and Yike Guo and Jie Fu},
      year={2023},
      eprint={2306.00107},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}