Home

Awesome

(简体中文|English)

SVG Banners

PyPI

<p align="center"> <a href="https://trendshift.io/repositories/3839" target="_blank"><img src="https://trendshift.io/api/badge/repositories/3839" alt="alibaba-damo-academy%2FFunASR | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a> </p>

<strong>FunASR</strong> hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model, researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun!

Highlights | News | Installation | Quick Start | Tutorial | Runtime | Model Zoo | Contact

<a name="highlights"></a>

Highlights

<a name="whats-new"></a>

What's new:

<details><summary>Full Changelog</summary> </details>

<a name="Installation"></a>

Installation

python>=3.8
torch>=1.13
torchaudio
pip3 install -U funasr
git clone https://github.com/alibaba/FunASR.git && cd FunASR
pip3 install -e ./
pip3 install -U modelscope huggingface_hub

Model Zoo

FunASR has open-sourced a large number of pre-trained models on industrial data. You are free to use, copy, modify, and share FunASR models under the Model License Agreement. Below are some representative models, for more models please refer to the Model Zoo.

(Note: ⭐ represents the ModelScope model zoo, 🤗 represents the Huggingface model zoo, 🍀 represents the OpenAI model zoo)

Model NameTask DetailsTraining DataParameters
SenseVoiceSmall <br> ( 🤗 )multiple speech understanding capabilities, including ASR, ITN, LID, SER, and AED, support languages such as zh, yue, en, ja, ko300000 hours234M
paraformer-zh <br> ( 🤗 )speech recognition, with timestamps, non-streaming60000 hours, Mandarin220M
<nobr>paraformer-zh-streaming <br> ( 🤗 )</nobr>speech recognition, streaming60000 hours, Mandarin220M
paraformer-en <br> ( 🤗 )speech recognition, without timestamps, non-streaming50000 hours, English220M
conformer-en <br> ( 🤗 )speech recognition, non-streaming50000 hours, English220M
ct-punc <br> ( 🤗 )punctuation restoration100M, Mandarin and English290M
fsmn-vad <br> ( 🤗 )voice activity detection5000 hours, Mandarin and English0.4M
fsmn-kws <br> ( )keyword spotting,streaming5000 hours, Mandarin0.7M
fa-zh <br> ( 🤗 )timestamp prediction5000 hours, Mandarin38M
cam++ <br> ( 🤗 )speaker verification/diarization5000 hours7.2M
Whisper-large-v3 <br> ( 🍀 )speech recognition, with timestamps, non-streamingmultilingual1550 M
Whisper-large-v3-turbo <br> ( 🍀 )speech recognition, with timestamps, non-streamingmultilingual809 M
Qwen-Audio <br> ( 🤗 )audio-text multimodal models (pretraining)multilingual8B
Qwen-Audio-Chat <br> ( 🤗 )audio-text multimodal models (chat)multilingual8B
emotion2vec+large <br> ( 🤗 )speech emotion recongintion40000 hours300M

<a name="quick-start"></a>

Quick Start

Below is a quick start tutorial. Test audio files (Mandarin, English).

Command-line usage

funasr ++model=paraformer-zh ++vad_model="fsmn-vad" ++punc_model="ct-punc" ++input=asr_example_zh.wav

Notes: Support recognition of single audio file, as well as file list in Kaldi-style wav.scp format: wav_id wav_pat

Speech Recognition (Non-streaming)

SenseVoice

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "iic/SenseVoiceSmall"

model = AutoModel(
    model=model_dir,
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
)

# en
res = model.generate(
    input=f"{model.model_path}/example/en.mp3",
    cache={},
    language="auto",  # "zn", "en", "yue", "ja", "ko", "nospeech"
    use_itn=True,
    batch_size_s=60,
    merge_vad=True,  #
    merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)

Parameter Description:

Paraformer

from funasr import AutoModel
# paraformer-zh is a multi-functional asr model
# use vad, punc, spk or not as you need
model = AutoModel(model="paraformer-zh",  vad_model="fsmn-vad",  punc_model="ct-punc", 
                  # spk_model="cam++", 
                  )
res = model.generate(input=f"{model.model_path}/example/asr_example.wav", 
                     batch_size_s=300, 
                     hotword='魔搭')
print(res)

Note: hub: represents the model repository, ms stands for selecting ModelScope download, hf stands for selecting Huggingface download.

Speech Recognition (Streaming)

from funasr import AutoModel

chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention

model = AutoModel(model="paraformer-zh-streaming")

import soundfile
import os

wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960 # 600ms

cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)
    print(res)

Note: chunk_size is the configuration for streaming latency. [0,10,5] indicates that the real-time display granularity is 10*60=600ms, and the lookahead information is 5*60=300ms. Each inference input is 600ms (sample points are 16000*0.6=960), and the output is the corresponding text. For the last speech segment input, is_final=True needs to be set to force the output of the last word.

<details><summary>More Examples</summary>

Voice Activity Detection (Non-Streaming)

from funasr import AutoModel

model = AutoModel(model="fsmn-vad")
wav_file = f"{model.model_path}/example/vad_example.wav"
res = model.generate(input=wav_file)
print(res)

Note: The output format of the VAD model is: [[beg1, end1], [beg2, end2], ..., [begN, endN]], where begN/endN indicates the starting/ending point of the N-th valid audio segment, measured in milliseconds.

Voice Activity Detection (Streaming)

from funasr import AutoModel

chunk_size = 200 # ms
model = AutoModel(model="fsmn-vad")

import soundfile

wav_file = f"{model.model_path}/example/vad_example.wav"
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = int(chunk_size * sample_rate / 1000)

cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size)
    if len(res[0]["value"]):
        print(res)

Note: The output format for the streaming VAD model can be one of four scenarios:

The output is measured in milliseconds and represents the absolute time from the starting point.

Punctuation Restoration

from funasr import AutoModel

model = AutoModel(model="ct-punc")
res = model.generate(input="那今天的会就到这里吧 happy new year 明年见")
print(res)

Timestamp Prediction

from funasr import AutoModel

model = AutoModel(model="fa-zh")
wav_file = f"{model.model_path}/example/asr_example.wav"
text_file = f"{model.model_path}/example/text.txt"
res = model.generate(input=(wav_file, text_file), data_type=("sound", "text"))
print(res)

Speech Emotion Recognition

from funasr import AutoModel

model = AutoModel(model="emotion2vec_plus_large")

wav_file = f"{model.model_path}/example/test.wav"

res = model.generate(wav_file, output_dir="./outputs", granularity="utterance", extract_embedding=False)
print(res)

More usages ref to docs, more examples ref to demo

</details>

Export ONNX

Command-line usage

funasr-export ++model=paraformer ++quantize=false ++device=cpu

Python

from funasr import AutoModel

model = AutoModel(model="paraformer", device="cpu")

res = model.export(quantize=False)

Test ONNX

# pip3 install -U funasr-onnx
from funasr_onnx import Paraformer
model_dir = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
model = Paraformer(model_dir, batch_size=1, quantize=True)

wav_path = ['~/.cache/modelscope/hub/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wav']

result = model(wav_path)
print(result)

More examples ref to demo

Deployment Service

FunASR supports deploying pre-trained or further fine-tuned models for service. Currently, it supports the following types of service deployment:

For more detailed information, please refer to the service deployment documentation.

<a name="contact"></a>

Community Communication

If you encounter problems in use, you can directly raise Issues on the github page.

You can also scan the following DingTalk group to join the community group for communication and discussion.

DingTalk group
<div align="left"><img src="docs/images/dingding.png" width="250"/>

Contributors

<div align="left"><img src="docs/images/alibaba.png" width="260"/><div align="left"><img src="docs/images/nwpu.png" width="260"/><img src="docs/images/China_Telecom.png" width="200"/> </div><img src="docs/images/RapidAI.png" width="200"/> </div><img src="docs/images/aihealthx.png" width="200"/> </div><img src="docs/images/XVERSE.png" width="250"/> </div>

The contributors can be found in contributors list

License

This project is licensed under The MIT License. FunASR also contains various third-party components and some code modified from other repos under other open source licenses. The use of pretraining model is subject to model license

Citations

@inproceedings{gao2023funasr,
  author={Zhifu Gao and Zerui Li and Jiaming Wang and Haoneng Luo and Xian Shi and Mengzhe Chen and Yabin Li and Lingyun Zuo and Zhihao Du and Zhangyu Xiao and Shiliang Zhang},
  title={FunASR: A Fundamental End-to-End Speech Recognition Toolkit},
  year={2023},
  booktitle={INTERSPEECH},
}
@inproceedings{An2023bat,
  author={Keyu An and Xian Shi and Shiliang Zhang},
  title={BAT: Boundary aware transducer for memory-efficient and low-latency ASR},
  year={2023},
  booktitle={INTERSPEECH},
}
@inproceedings{gao22b_interspeech,
  author={Zhifu Gao and ShiLiang Zhang and Ian McLoughlin and Zhijie Yan},
  title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={2063--2067},
  doi={10.21437/Interspeech.2022-9996}
}
@inproceedings{shi2023seaco,
  author={Xian Shi and Yexin Yang and Zerui Li and Yanni Chen and Zhifu Gao and Shiliang Zhang},
  title={SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability},
  year={2023},
  booktitle={ICASSP2024}
}