Awesome

voiceapi - A simple and clean voice transcription/synthesis API with sherpa-onnx

Thanks to k2-fsa/sherpa-onnx, we can easily build a voice API with Python. <img src="./screenshot.jpg" width="60%">

Supported models

Model	Language	Type	Description
zipformer-bilingual-zh-en-2023-02-20	Chinese + English	Online ASR	Streaming Zipformer, Bilingual
sense-voice-zh-en-ja-ko-yue-2024-07-17	Chinese + English	Offline ASR	SenseVoice, Bilingual
paraformer-trilingual-zh-cantonese-en	Chinese + Cantonese + English	Offline ASR	Paraformer, Trilingual
paraformer-en-2024-03-09	English	Offline ASR	Paraformer, English
vits-zh-hf-theresa	Chinese	TTS	VITS, Chinese, 804 speakers
melo-tts-zh_en	Chinese + English	TTS	Melo, Chinese + English, 1 speakers

Run the app locally

Python 3.10+ is required

python3 -m venv venv
. venv/bin/activate

pip install -r requirements.txt
python app.py

Visit http://localhost:8000/ to see the demo page

Build cuda image (for Chinese users)

docker build -t voiceapi:cuda_dev -f Dockerfile.cuda.cn .

Streaming API (via WebSocket)

/asr

Send PCM 16bit audio data to the server, and the server will return the transcription result.

samplerate can be set in the query string, default is 16000.

The server will return the transcription result in JSON format, with the following fields:

text: the transcription result
finished: whether the segment is finished
idx: the index of the segment

    const ws = new WebSocket('ws://localhost:8000/asr?samplerate=16000');
    ws.onopen = () => {
        console.log('connected');
        ws.send('{"sid": 0}');
    };
    ws.onmessage = (e) => {
        const data = JSON.parse(e.data);
        const { text, finished, idx } = data;
        // do something with text
        // finished is true when the segment is finished
    };
    // send audio data
    // PCM 16bit, with samplerate
    ws.send(int16Array.buffer);

/tts

Send text to the server, and the server will return the synthesized audio data.

samplerate can be set in the query string, default is 16000.
sid is the Speaker ID, default is 0.
speed is the speed of the synthesized audio, default is 1.0.
chunk_size is the size of the audio chunk, default is 1024.

The server will return the synthesized audio data in binary format.

The audio data is in PCM 16bit format, with the binary data in the response body.
The server will return the synthesized result with json format, with the following fields:
- elapsed: the elapsed time
- progress: the progress of the synthesis
- duration: the duration of the synthesis
- size: the size of the synthesized audio data

    const ws = new WebSocket('ws://localhost:8000/tts?samplerate=16000');
    ws.onopen = () => {
        console.log('connected');
        ws.send('Your text here');
    };
    ws.onmessage = (e) => {
        if (e.data instanceof Blob) {
            // Chunked audio data
            e.data.arrayBuffer().then((arrayBuffer) => {
                const int16Array = new Int16Array(arrayBuffer);
                let float32Array = new Float32Array(int16Array.length);
                for (let i = 0; i < int16Array.length; i++) {
                    float32Array[i] = int16Array[i] / 32768.;
                }
                playNode.port.postMessage({ message: 'audioData', audioData: float32Array });
            });
        } else {
            // The server will return the synthesized result
            const {elapsed, progress, duration, size } = JSON.parse(e.data);
            this.elapsedTime = elapsed;
        }
    };

No Streaming API

/tts

Send text to the server, and the server will return the synthesized audio data.

text is the text to be synthesized.
samplerate can be set in the query string, default is 16000.
sid is the Speaker ID, default is 0.
speed is the speed of the synthesized audio, default is 1.0.

curl -X POST "http://localhost:8000/tts" \
     -H "Content-Type: application/json" \
     -d '{
           "text": "Hello, world!",
           "sid": 0,
           "samplerate": 16000
         }' -o helloworkd.wav

Download models

All models are stored in the models directory Only download the models you need. default models are:

asr models: sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20(Bilingual, Chinese + English). Streaming
tts models: vits-zh-hf-theresa (Chinese + English)

vits-zh-hf-theresa

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-theresa.tar.bz2

vits-melo-tts-zh_en

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-melo-tts-zh_en.tar.bz2

sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20.tar.bz2

silero_vad.onnx

curl -SL -O https://github.com/snakers4/silero-vad/raw/master/src/silero_vad/data/silero_vad.onnx

sherpa-onnx-paraformer-trilingual-zh-cantonese-en

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-paraformer-trilingual-zh-cantonese-en.tar.bz2

whisper

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-whisper-tiny.en.tar.bz2

sensevoice

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2

sherpa-onnx-streaming-paraformer-bilingual-zh-en

curl -SL -O  https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-paraformer-bilingual-zh-en.tar.bz2

sherpa-onnx-paraformer-trilingual-zh-cantonese-en

curl -SL -O  https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-paraformer-trilingual-zh-cantonese-en.tar.bz2

sherpa-onnx-paraformer-en

curl -SL -O  https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-paraformer-en-2024-03-09.tar.bz2