Awesome
<div align="center"> <h1> SLAM-LLM </h1> <p> <b>SLAM-LLM</b> is a deep learning toolkit that allows researchers and developers to train custom multimodal large language model (MLLM), focusing on <b>S</b>peech, <b>L</b>anguage, <b>A</b>udio, <b>M</b>usic processing. We provide detailed recipes for training and high-performance checkpoints for inference. <br> </p> <p> <img src="docs/logo.jpg" alt="SLAM-LLM Logo" style="width: 200px; height: 200px;"> </p> <p> </p> <a href="https://github.com/ddlBoJack/SLAM-LLM"><img src="https://img.shields.io/badge/Platform-linux-lightgrey" alt="version"></a> <a href="https://github.com/ddlBoJack/SLAM-LLM"><img src="https://img.shields.io/badge/Cuda-11.8+-orange" alt="version"></a> <a href="https://github.com/ddlBoJack/SLAM-LLM"><img src="https://img.shields.io/badge/PyTorch-2.01+-brightgreen" alt="python"></a> <a href="https://github.com/ddlBoJack/SLAM-LLM"><img src="https://img.shields.io/badge/License-MIT-red.svg" alt="mit"></a> </div>Table of Contents
News
- [Update Oct. 12, 2024] Recipes for SLAM-AAC have been supported.
- [Update Sep. 28, 2024] Recipes for CoT-ST have been supported.
- [Update Sep. 25, 2024] Recipes for DRCap have been supported.
- [Update Jun. 12, 2024] Recipes for MaLa-ASR have been supported.
- [CALL FOR EXAMPLE] We sincerely invite developers and researchers to develop new applications, conduct academic research based on SLAM-LLM, and pull request your examples! We also acknowledge engineering PR (such as improving and speeding up multi-node training).
- [Update May. 22, 2024] Please join slack or WeChat group. We will sync our updates and Q&A here.
- [Update May. 21, 2024] Recipes for Spatial Audio Understanding have been supported.
- [Update May. 20, 2024] Recipes for music caption (MC) have been supported.
- [Update May. 8, 2024] Recipes for visual speech recognition (VSR) have been supported.
- [Update May. 4, 2024] Recipes for zero-shot text-to-speech (TTS) have been supported.
- [Update Apr. 28, 2024] Recipes for automated audio captioning (AAC) have been supported.
- [Update Mar. 31, 2024] Recipes for automatic speech recognition (ASR) have been supported.
Installation
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout tags/v4.35.2
pip install -e .
cd ..
git clone https://github.com/huggingface/peft.git
cd peft
git checkout tags/v0.6.0
pip install -e .
cd ..
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
git clone https://github.com/ddlBoJack/SLAM-LLM.git
cd SLAM-LLM
pip install -e .
For some examples, you may need to use fairseq
, the command line is as follows:
# you need to install fairseq before SLAM-LLM
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
We also provide a docker image for convenience:
# build docker image
docker build -t slam-llm:latest .
# run docker image with gpu
docker run -it --gpus all --name slam --shm-size=256g slam-llm:latest /bin/bash
Usage
List of Recipes
We provide reference implementations of various LLM-based speech, audio, and music tasks:
- Speech Task
- Automatic Speech Recognition (ASR)
- Contextual Automatic Speech Recognition (CASR)
- Visual Speech Recognition (VSR)
- Speech-to-Text Translation (S2TT)
- Text-to-Speech (TTS)
- Audio Task
- Automated Audio Captioning (AAC)
- Spatial Audio Understanding
- Music Task
Configuration Priority
We provide hierarchical configuration inheritance relationships as follows:
command-line (shell file) > Hydra configuration (yaml file) > dataclass configuration (Python file)
Features
- Easily extend to new models and tasks.
- Detailed recipes for training and high-performance checkpoints for inference.
- Mixed precision training which trains faster with less GPU memory on NVIDIA tensor cores.
- Multi-GPU training with data and model parallel, supporting DDP, FSDP and deepspeed (still need to be improved).
- Flexible configuration based on Hydra and dataclass allowing a combination of code, command-line and file based configuration.
Acknowledge
- We borrow code from Llama-Recipes for the training process.
- We borrow code from Fairseq for deepspeed configuration.
- We thank the contributors for providing diverse recipes.
Citation
SLAM-ASR:
@article{ma2024embarrassingly,
title={An Embarrassingly Simple Approach for LLM with Strong ASR Capacity},
author={Ma, Ziyang and Yang, Guanrou and Yang, Yifan and Gao, Zhifu and Wang, Jiaming and Du, Zhihao and Yu, Fan and Chen, Qian and Zheng, Siqi and Zhang, Shiliang and others},
journal={arXiv preprint arXiv:2402.08846},
year={2024}
}
SLAM-AAC:
@article{chen2024slam,
title={SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs},
author={Chen, Wenxi and Ma, Ziyang and Li, Xiquan and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Yu, Kai and Chen, Xie},
journal={arXiv preprint arXiv:2410.09503},
year={2024}
}