Awesome

Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model

What's NEW?

Tutorials are ready!
- Tutorial1-Encoding-Audio-BYOL-A.ipynb - Shows basic steps for getting audio features with BYOL-A.
- Tutorial2-Visualize-Audios-BYOL-A.ipynb - Visualize BYOL-A features of audios with pitch and amplitude changes.
- Tutorial3-Try-VGGish-Variants.ipynb - Shows basic steps of VGGish and VGGish-Fusion.
- Tutorial4-Try-CNN14-Variants.ipynb - Shows basic steps of CNN14 and CNN14-Fusion.
- Tutorial5-Visualize-Dataset-Samples.ipynb - Visualizes US8K and CREMA-D using features extracted by BYOL-A, VGGish, and CNN14.

This repository offers implementation of our paper "Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model."

The VGGish-Fusion and CNN14-Fusion classes are available not only for reproduction but also for your application studies. The command lines for reproducing all the results on the paper are also available.

If you find our study useful, please consider citing our paper. The followings are BibTeX entries for citation.

@inproceedings{niizumi2022composing,
    title       = {Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model},
    author      = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
    booktitle   = {2022 30th European Signal Processing Conference (EUSIPCO)}, 
    year        = {2022},
    pages       = {200-204},
    url         = {https://arxiv.org/abs/2205.08138}
}

@techreport{niizumi2022composing,
    author      = "仁泉大輔 and 竹内大起 and 大石康智 and 原田登 and 柏野邦夫",
    title       = "事前学習モデルの複数層特徴量の融合を用いた汎用音響信号表現",
    booktitle   = "信学技報 (EA2022-9)",
    volume      = "122",
    number      = "20",
    pages       = "41--45",
    month       = "May",
    year        = "2022",
    url         = {https://www.ieice.org/ken/paper/20220513BCK3/}
}

The following shows that the original pre-trained models (VGGish, Cnn14, and AST) improve significantly (*-Fusion) without any additinal training.

table3

1. Quick Example

The following is an example of VGGish using the improved version in your application.

from gp_vggish import GeneralPurposeVGGish

model = GeneralPurposeVGGish()

# Load pre-trained weights here.
# model.load_state_dict(torch.load(your_weight_file))

# Suppose x is a batch (size=32) of log mel-spectrogram of 1 second audio. This is a random sample.
x = torch.rand(32, 1, 96, 64)

# We get embeddings of [32, 8192], where dimension is 8192.
sample_level_embeddings = model(x)
# We get embeddings of [32, 8192, 6], where number of frames is 6.
frame_level_embeddings = model.encode_frames(x)

2. Setup

This repository relies on some external codes, especially our evaluation package nttcslab/eval-audio-repr (EVAR).

2-1. Setup repository contents

Run the steps in Setup-commands.txt. It will make external files downloaded and modified accordingly under a new folder, evar.

2-2. Setup data

You need copies of datasets for evaluating models.

See EVAR documents, evar/README.md, and evar/Preparing-datasets.md for more information.

3. Fusion Model Implementation

The followings describe the implementation of the three fusion models reported in the paper.

3-1. VGGish-Fusion

VGGish-Fusion is implemented as a class GeneralPurposeVGGish in gp_vggish.py. It has an extra parameter, layers, to specify which layers to fuse. The layers takes a list that enumerates the index of layers. In our paper, we used ReLU layer outputs from [1, 4, 7, 9, 12, 14, 17, 19, 21].

The pre-trained weight will be loaded in the class AR_VGGish (evar/evar/ar_vggish.py), via the fusion wrapper class AR_VGGish_Fusion (to_evar/evar/ar_vggish_ext.py).

For converting raw audio to the log-mel spectrogram for VGGish input, you can find an implementation of the to_audio_features member function of the AR_VGGish class.

3-2. Cnn14 fusion model

CNN14-Fusion is implemented as a class GeneralPurposeCnn14 in gp_cnn14.py. It has an extra parameter, layers, to specify which network blocks to fuse. The layers takes a list that enumerates the index of blocks. Available indexes are 1 to 6. In our paper, we used [3, 6].

The pre-trained weight (https://zenodo.org/record/3987831/files/Cnn14_16k_mAP%3D0.438.pth) will be downloaded and loaded in the class AR_Cnn14 (evar/evar/ar_cnn14.py), via the fusion wrapper class AR_Cnn14_Fusion (to_evar/evar/ar_cnn14_ext.py).

For converting raw audio to the log-mel spectrogram for Cnn14 input, you can find an implementation of the feature_extractor member of the AR_Cnn14 class.

3-3. AST fusion model

AST-Fusion is implemented in a class AR_AST_Fusion in ar_ast_ext.py. You can specify layers to fuse in ast_layers in the configuration file ast_fusion.yaml. In our paper, we used [4, 11].

The pre-trained weight will be downloaded and loaded in the same fashion as the original AST implementation. We use a wrapper class AR_AST_Fusion (to_evar/evar/ar_ast_ext.py).

4. Reproducing Paper

For reproducing results in the paper, use the command lines in CommandLines.md.

You will need to setup all the models and datasets in your copy of EVAR. See EVAR documents, evar/README.md, evar/Preparing-models.md, and evar/Preparing-datasets.md for more information.

If you could successfully prepare all the stuff, the following example to test Cnn14-Fusion on the CREMA-D will result in an accuracy of about 58~59%.

cd evar
python 2pass_lineareval.py config/cnn14_fusion.yaml cremad +name=AR_Cnn14_Fusion36

License

See LICENSE for the detail.