


This is the official implementation of paper: "Audio Prompt Adapter: Unleashing Music Editing Abilities for Text-to-Music with Lightweight Finetuning" in Proc. Int. Society for Music Information Retrieval Conf. (ISMIR), 2024.
arXiv | Demo website | Demo video

*These folder contains the evaluation dataset use in the paper.


SDEdit-AudioLDM2 is the code use for AudioLDM2 SDEdit in evaluation. Please give me a star if you found this project useful or inspiring~


We provide a step-by-step series of examples that explain how to set up a development environment.

git clone https://github.com/fundwotsai2001/AP-adapter.git
conda create -n <env_name> python=3.11
cd AP-adapter
pip install -r requirements.txt

Downloading checkpoint

For AudioMAE checkpoint you can download it from pretrain
For AP-adapter checkpoint you can download it from AP-adpater

gdown https://drive.google.com/uc?id=1ni_DV4dRf7GxM8k-Eirx71WP9Gg89wwu
gdown https://drive.google.com/uc?id=1LS3KeczUwfMzk8Cf5oTvkjCkqK3OVnkZ
# If the command doesn't work, you may consider upgrade gdown, ex. pip install gdown --upgrade

Parameters in inference.py

We have standard parameter sets for the three tasks, you can go to demo to see the detail settings, or directly use the template in config.py, you can also change the prompt settings there. Note that the effect of hyper-parameters are mentioned in the paper, but generally "ap_scale" is proportional to the audio strength, "time_pooling" and "frequency_pooling" are inversely proportional to the audio control strength. You can adjust these parameters to fit your requirement, or just use the default settings.

python inferece --task timbre_transfer
python inferece --task style_transfer
python inferece --task accompaniment_generation
## if you want to try something cool, use test and change the template in config.py
python inferece --task test

Unfortunately, the accompaniment generation does not perform well enough with the previous training, we are still working on it.

Train from scratch

It's also recommend to train from scratch if you have powerful computation resource, the checkpoint we provide was only trained for 35000 steps, with effective batchsize 32. We only use 200k audio-text pairs from Audioset due to memory capacity. To use our training code, you can use https://github.com/dlrudco/Fast-Audioset-Download to download the dataset, and put "Fast-Audioset-Download" in the way below.

├── AP-adapter/
└── Fast-Audioset-Download/

After handling the dataset structure, you can run the command to train the adapter from scratch:

##change the DATA_DIR and OUTPUT_DIR in train.sh, and run

Or you can start from the previously downloaded checkpoint AP-adpater.

##change the DATA_DIR and OUTPUT_DIR in finetune.sh, and run

Note that the settings in train.sh and finetune.sh are for users that have VRAM more than 24GB, if you have lower VRAM, you can uncomment these two arguments:

--use_8bit_adam \
--mixed_precision "bf16" \ 
## or you can also use "fp16"

Or simply use a lower batchsize:

## choose a moderate size
--train_batch_size= 1 


This project is dual-licensed under the Apache-2.0 License and the CC-BY-NC-SA-4.0 License.

CC-BY-NC-SA-4.0 License

For the AudioLDM2 checkpoint and the weights in /copied_cross_attention folder are under CC-BY-NC-SA-4.0 License.

Apache-2.0 License

The code in this repository is licensed under the Apache License 2.0

Cite this work

      title={Audio Prompt Adapter: Unleashing Music Editing Abilities for Text-to-Music with Lightweight Finetuning}, 
      author={Fang-Duo Tsai and Shih-Lun Wu and Haven Kim and Bo-Yu Chen and Hao-Chung Cheng and Yi-Hsuan Yang},


This code is heavily based on AudioLDM2, Diffusers, and IP-adapter.