Home

Awesome

<div align="center"> <h1> SLAM-LLM </h1> <p> <b>SLAM-LLM</b> is a deep learning toolkit that allows researchers and developers to train custom multimodal large language model (MLLM), focusing on <b>S</b>peech, <b>L</b>anguage, <b>A</b>udio, <b>M</b>usic processing. We provide detailed recipes for training and high-performance checkpoints for inference. <br> </p> <p> <img src="docs/logo.jpg" alt="SLAM-LLM Logo" style="width: 200px; height: 200px;"> </p> <p> </p> <a href="https://github.com/ddlBoJack/SLAM-LLM"><img src="https://img.shields.io/badge/Platform-linux-lightgrey" alt="version"></a> <a href="https://github.com/ddlBoJack/SLAM-LLM"><img src="https://img.shields.io/badge/Cuda-11.8+-orange" alt="version"></a> <a href="https://github.com/ddlBoJack/SLAM-LLM"><img src="https://img.shields.io/badge/PyTorch-2.01+-brightgreen" alt="python"></a> <a href="https://github.com/ddlBoJack/SLAM-LLM"><img src="https://img.shields.io/badge/License-MIT-red.svg" alt="mit"></a> </div>

Table of Contents

  1. News
  2. Installation
  3. Usage
  4. Features
  5. Acknowledge
  6. Citation

News

Installation

git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout tags/v4.35.2
pip install -e .
cd ..
git clone https://github.com/huggingface/peft.git
cd peft
git checkout tags/v0.6.0
pip install -e .
cd ..
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
git clone https://github.com/ddlBoJack/SLAM-LLM.git
cd SLAM-LLM
pip install  -e .

For some examples, you may need to use fairseq, the command line is as follows:

# you need to install fairseq before SLAM-LLM
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

We also provide a docker image for convenience:

# build docker image
docker build -t slam-llm:latest .

# run docker image with gpu
docker run -it --gpus all --name slam --shm-size=256g slam-llm:latest /bin/bash

Usage

List of Recipes

We provide reference implementations of various LLM-based speech, audio, and music tasks:

Configuration Priority

We provide hierarchical configuration inheritance relationships as follows:

command-line (shell file) > Hydra configuration (yaml file) > dataclass configuration (Python file)

Features

Acknowledge

Citation

Speech Task

SLAM-ASR:

@article{ma2024embarrassingly,
  title={An Embarrassingly Simple Approach for LLM with Strong ASR Capacity},
  author={Ma, Ziyang and Yang, Guanrou and Yang, Yifan and Gao, Zhifu and Wang, Jiaming and Du, Zhihao and Yu, Fan and Chen, Qian and Zheng, Siqi and Zhang, Shiliang and others},
  journal={arXiv preprint arXiv:2402.08846},
  year={2024}
}

Mala-ASR:

@article{yang2024mala,
  title={MaLa-ASR: Multimedia-Assisted LLM-Based ASR},
  author={Yang, Guanrou and Ma, Ziyang and Yu, Fan and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
  journal={Proc. INTERSPEECH},
  year={2024}
}

LLM-Based Contextual ASR:

@article{yang2024ctc,
  title={CTC-Assisted LLM-Based Contextual ASR},
  author={Yang, Guanrou and Ma, Ziyang and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
  journal={Proc. SLT},
  year={2024}
}

CoT-ST:

@article{du2024cot,
  title={CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought},
  author={Du, Yexing and Ma, Ziyang and Yang, Yifan and Deng, Keqi and Chen, Xie and Yang, Bo and Xiang, Yang and Liu, Ming and Qin, Bing},
  journal={arXiv preprint arXiv:2409.19510},
  year={2024}
}

Audio Task

SLAM-AAC:

@article{chen2024slam,
  title={SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs},
  author={Chen, Wenxi and Ma, Ziyang and Li, Xiquan and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Yu, Kai and Chen, Xie},
  journal={arXiv preprint arXiv:2410.09503},
  year={2024}
}

DRCap:

@article{li2024drcap,
  title={DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning},
  author={Li, Xiquan and Chen, Wenxi and Ma, Ziyang and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Kong, Qiuqiang and Chen, Xie},
  journal={arXiv preprint arXiv:2410.09472},
  year={2024}
}

BAT:

@article{zheng2024bat,
  title={BAT: Learning to Reason about Spatial Sounds with Large Language Models},
  author={Zheng, Zhisheng and Peng, Puyuan and Ma, Ziyang and Chen, Xie and Choi, Eunsol and Harwath, David},
  journal={Proc. ICML},
  year={2024}
}