Home

Awesome

<div align="center">

Audio-Mamba (AuM)

Bidirectional State Space Model for Audio Representation Learning

ArXiv Preprint: https://arxiv.org/abs/2406.03344

</div>

News

Index

Overview

This repository contains the implementation of Audio-Mamba (AuM), a generic, self-attention-free and purely state space model designed for audio classification. It provides the necessary code for training and evaluating the model across various audio classification benchmarks. AuM is built on the works AST and ViM, and it utilizes Hugging Face's Accelerate library to facilitate efficient multi-GPU training.

<div align="center"> <img src="AuM.png" alt="Pipeline" style="width: 50%;"/> </div>

Setting Up the Repository

Please run the following commands to set up the repository:

Create a Conda Environment

conda create -n aum python=3.10.13
conda activate aum

Setting Up CUDA and CuDNN

conda install nvidia/label/cuda-11.8.0::cuda-nvcc
conda install nvidia/label/cuda-11.8.0::cuda

Try: 
conda install anaconda::cudnn
Else:
conda install -c conda-forge cudnn

Installing PyTorch and Other Dependencies

pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Installing Mamba Related Packages

pip install causal_conv1d==1.1.3.post1 mamba_ssm==1.1.3.post1

Enabling Bidirectional SSM Processing

To integrate the modifications for supporting bidirectional processing, copy the mamba_ssm folder to the site-packages directory of the Python installation within the Conda environment. This folder is directly borrowed from the ViM repository.

cp -rf vim-mamba_ssm/mamba_ssm $CONDA_PREFIX/lib/python3.10/site-packages

Inference

Example Inference

An example notebook for inference is provided in the examples/inference directory. The notebook demonstrates a minimal example of how to load a trained model and perform inference on a sample audio file.

Evaluation Scripts

Each dataset folder within the exps/ directory includes an example evaluation script for AuM (aum_eval.sh).

Training

Overview

Each dataset's training scripts and relevant files are located within their respective folders under the exps/ directory. These folders include:

Executing Training Scripts

To execute the training scripts:

  1. Navigate to the dataset's directory (e.g., exps/vggsound/).
  2. Run the corresponding script (e.g., bash aum-base_scratch-vggsound.sh).

Note: The scripts are prepared for execution but require modification of paths (such as experiment directories) to fit your specific setup.

Multiple GPU Training

For training on multiple GPUs:

  1. Set GPU IDs: List the GPU IDs in the CUDA_VISIBLE_DEVICES environment variable (e.g., CUDA_VISIBLE_DEVICES=0,1,2,...).
  2. Adjust Batch Size: Set the batch_size argument in the script to the desired batch size per GPU.

Note: To maintain the effective batch size from single GPU training, divide the batch size by the number of GPUs.

EPIC-SOUNDS Dataset

The EPIC-SOUNDS dataset has a distinct training structure:

For the full reference regarding this dataset, please refer to the EPIC-SOUNDS repository.

Model Checkpoints

The model checkpoints are available for the following experiments:

Base Scratch

These are the checkpoints for the base models with the variant Fo-Bi (b), trained from scratch.

Dataset#ParamsPerformanceCheckpoint
Audioset (mAP)92.1M32.74Link
AS-20K (mAP)92.1M14.05Link
VGGSound (Acc)91.9M42.97Link
VoxCeleb (Acc)92.7M33.12Link
Speech Commands V2 (Acc)91.4M94.44Link
Epic Sounds (Acc)91.7M44.92Link

Small ImageNet

These are the checkpoints for the small models with the variant Bi-Bi (c), initialized with ImageNet pretrained weights.

Dataset#ParamsPerformanceCheckpoint
Audioset (mAP)25.5M39.74Link
AS-20K (mAP)25.5M29.17Link
VGGSound (Acc)25.5M49.61Link
VoxCeleb (Acc)25.8M41.78Link
Speech Commands V2 (Acc)25.2M97.61Link
Epic Sounds (Acc)25.4M53.45Link

Base AudioSet

These are the checkpoints for the base models with the variant Fo-Bi (b), initialized with AudioSet pretrained weights.

Dataset#ParamsPerformanceCheckpoint
VGGSound (Acc)91.9M46.78Link
VoxCeleb (Acc)92.7M41.82Link
Speech Commands V2 (Acc)91.4M94.82Link
Epic Sounds (Acc)91.7M48.31Link

Citation

If you find this work useful, please consider citing us:

@article{erol2024audio,
  title={Audio Mamba: Bidirectional State Space Model for Audio Representation Learning},
  author={Erol, Mehmet Hamza and Senocak, Arda and Feng, Jiu and Chung, Joon Son},
  journal={arXiv preprint arXiv:2406.03344},
  year={2024}
}