Home

Awesome

<div class="title" align=center> <h1>vits-simple-api</h1> <div>Simply call the vits api</div> <br/> <br/> <p> <img src="https://img.shields.io/github/license/Artrajz/vits-simple-api"> <img src="https://img.shields.io/badge/python-3.10-green"> <a href="https://hub.docker.com/r/artrajz/vits-simple-api"> <img src="https://img.shields.io/docker/pulls/artrajz/vits-simple-api"></a> </p> <a href="https://github.com/Artrajz/vits-simple-api/blob/main/README.md">English</a>|<a href="https://github.com/Artrajz/vits-simple-api/blob/main/README_zh.md">中文文档</a> <br/> </div>

Feature

Online Demo

Hugging Face Spaces Thanks to Hugging Face!

Colab Notebook

Please note that different IDs may support different languages.speakers

https://user-images.githubusercontent.com/73542220/237995061-c1f25b4e-dd86-438a-9363-4bb1fe65b425.mov

Deployment

There are two deployment options to choose from. Regardless of the option you select, you'll need to import the model after deployment to use the application.

Docker Deployment (Recommended for Linux)

Step 1: Pull the Docker Image

Run the following command to pull the Docker image. Follow the prompts in the script to choose the necessary files to download and pull the image:

bash -c "$(wget -O- https://raw.githubusercontent.com/Artrajz/vits-simple-api/main/vits-simple-api-installer-latest.sh)"

The default paths for project configuration files and model folders are /usr/local/vits-simple-api/.

Step 2: Start

Run the following command to start the container:

docker-compose up -d

Image Update

To update the image, run the following commands:

docker-compose pull

Then, restart the container:

docker-compose up -d

Virtual Environment Deployment

Step 1: Clone the Project

Clone the project repository using the following command:

git clone https://github.com/Artrajz/vits-simple-api.git

Step 2: Install Python Dependencies

It is recommended to use a virtual environment with Python version 3.10 for this project. Run the following command to install the Python dependencies required for the project:

If you encounter issues installing certain dependencies, please refer to the common problems outlined below.

pip install -r requirements.txt

Step 3: Start

Run the following command to start the program:

python app.py

Windows Quick Deployment Package

Step 1: Download and Extract the Deployment Package

Go to the releases page and download the latest deployment package. Extract the downloaded files.

Step 2: Start

Run start.bat to launch the program.

Model Loading

Step 1: Download VITS Models

Download the VITS model files and place them in the data/models folder.

Step 2: Loading Models

Automatic Model Loading

Starting from version 0.6.6, it is default behavior to automatically load all models in the data/models folder, making it easier for beginners to use.

Manual Model Loading

After the initial startup, a config.yaml configuration file will be generated. You need to change tts_config.auto_load to false in order to enable manual loading mode.

You can modify the tts_config.models in the config.yaml or make modifications in the admin panel in the browser.

Note: After version 0.6.6, the model loading path has been modified. Please follow the steps below to configure the model path again!

The path can be an absolute path or a relative path. If it's a relative path, it starts from the data/models folder in the project root directory.

For example, if the data/models folder has the following files:

├─model1
│  │─G_1000.pth
│  └─config.json
└─model2
   │─G_1000.pth
   └─config.json

Fill in the configuration like this in the YAML file:

tts_config:
  auto_load: false
  models:
  - config_path: model1/config.json
    model_path: model1/G_1000.pth
  - config_path: model2/config.json
    model_path: model2/G_1000.pth
	# GPT-SoVITS
  - sovits_path: gpt_sovits1/model1_e8_s11536.pth
    gpt_path: gpt_sovits1/model1-e15.ckpt
  - sovits_path: gpt_sovits2/model2_e8_s11536.pth
    gpt_path: gpt_sovits2/model2-e15.ckpt

Loading models through the admin panel is convenient, but if you want to load models outside the data/models folder, you can only do so by modifying the config.yaml configuration file. The method is to directly provide the absolute path.

Absolute path example:

tts_config:
  auto_load: false
  models:
  - config_path: D://model3/config.json
    model_path: D://model3/G_1000.pth

Other Models

After downloading the BERT model and emotion model, place them in the data/bert and data/emotional folders respectively. Find the corresponding names and insert them accordingly.

GPU accelerated

Windows

Install CUDA

Check the highest version of CUDA supported by your graphics card:

nvidia-smi

Taking CUDA 11.7 as an example, download it from the official website

Install GPU version of PyTorch

https://pytorch.org/

pip install torch --index-url https://download.pytorch.org/whl/cu118

Linux

The installation process is similar, but I don't have the environment to test it.

WebUI

Inference Frontend

http://127.0.0.1:23456

*Port is modifiable under the default setting of port 23456.

Admin Backend

The default address is http://127.0.0.1:23456/admin.

The initial username and password can be found by searching for 'admin' in the config.yaml file after the first startup.

Function Options Explanation

Disable the Admin Backend

The admin backend allows loading and unloading models, and while it has login authentication, for added security, you can disable the admin backend in the config.yaml:

'IS_ADMIN_ENABLED': !!bool 'false'

This extra measure helps ensure absolute security when making the admin backend inaccessible to the public network.

Bert-VITS2 Configuration and Language/Bert Model Usage

Starting from Bert-VITS2 v2.0, a model requires loading three different language Bert models. If you only need to use one or two languages, you can add the lang parameter in the config.json file of the model's data section. The value ["zh"] indicates that the model only uses Chinese and will load Chinese Bert models. The value ["zh", "ja"] indicates the usage of both Chinese and Japanese bilingual models, and only Chinese and Japanese Bert models will be loaded. Similarly, this pattern continues for other language combinations.

Example:

"data": {
  "lang": ["zh", "ja"],
  "training_files": "filelists/train.list",
  "validation_files": "filelists/val.list",
  "max_wav_value": 32768.0,
  ...

Custom Chinese Polyphonic Dictionary

If you encounter issues with incorrect pronunciation of polyphonic characters, you can try resolving it using the following method.

Create and open phrases_dict.txt in the data directory to add polyphonic words.

{
"一骑当千": [["yí"], ["jì"], ["dāng"], ["qiān"]],
}

GPT-SoVITS Reference Audio Presets

Find the configuration for GPT-SoVITS in the config.yaml file. Add presets under the presets section. Multiple presets can be added, with keys serving as preset names. Below are two default presets, default and default2:

gpt_sovits_config:
  hz: 50
  is_half: false
  id: 0
  lang: auto
  format: wav
  segment_size: 50
  presets:
    default:
      refer_wav_path: null
      prompt_text: null
      prompt_lang: auto
    default2:
      refer_wav_path: null
      prompt_text: null
      prompt_lang: auto

Reading API

Tested in legado

Multiple models can be used for reading, including VITS, Bert-VITS2, GPT-SoVITS. Parameters starting with in configure the speaker of the text in quotes, while parameters starting with nr configure the narrator.

To use GPT-SoVITS, it is necessary to configure the reference audio in the presets section of the config.yaml file in advance and modify the preset in the URL below.

The IP in the URL can be found after the API is started, generally using a local area network IP starting with 192.168.

After modification, select the reading engine, add the reading engine, paste the source, and enable the reading engine.

{
  "concurrentRate": "1",
  "contentType": "audio/wav",
  "enabledCookieJar": false,
  "header": "",
  "id": 1709643305070,
  "lastUpdateTime": 1709821070082,
  "loginCheckJs": "",
  "loginUi": "",
  "loginUrl": "",
  "name": "vits-simple-api",
  "url": "http://192.168.xxx.xxx:23456/voice/reading?text={{java.encodeURI(speakText)}}&in_model_type=GPT-SOVITS&in_id=0&in_preset=default&nr_model_type=BERT-VITS2&nr_id=0&nr_preset=default&format=wav&lang=zh"
}

Frequently Asked Questions

Bert-VITS2 Version Compatibility

To ensure compatibility with the Bert-VITS2 model, modify the config.json file by adding a version parameter "version": "x.x.x". For instance, if the model version is 1.0.1, the configuration file should be written as:

{
  "version": "1.0.1",
  "train": {
    "log_interval": 10,
    "eval_interval": 100,
    "seed": 52,
    ...

Please note that for the Chinese extra version, the version should be changed to extra or zh-clap, and for the extra fix version, the version should be 2.4 or extra-fix.

API

GET

speakers list

voice vits

check

POST

API KEY

Set api_key_enabled: true in config.yaml to enable API key authentication. The API key is api_key: api-key. After enabling it, you need to add the api_key parameter in GET requests and add the X-API-KEY parameter in the header for POST requests.

Parameter

VITS

NameParameterIs mustDefaultTypeInstruction
Synthesized texttexttruestrText needed for voice synthesis.
Speaker IDidfalseFrom config.yamlintThe speaker ID.
Audio formatformatfalseFrom config.yamlstrSupport for wav,ogg,silk,mp3,flac
Text languagelangfalseFrom config.yamlstrThe language of the text to be synthesized. Available options include auto, zh, ja, and mix. When lang=mix, the text should be wrapped in [ZH] or [JA].The default mode is auto, which automatically detects the language of the text
Audio lengthlengthfalseFrom config.yamlfloatAdjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed.
NoisenoisefalseFrom config.yamlfloatSample noise, controlling the randomness of the synthesis.
SDP noisenoisewfalseFrom config.yamlfloatStochastic Duration Predictor noise, controlling the length of phoneme pronunciation.
Segment Sizesegment_sizefalseFrom config.yamlintDivide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds segment_size. If segment_size<=0, the text will not be divided into paragraphs.
Streaming responsestreamingfalsefalseboolStreamed synthesized speech with faster initial response.

VITS voice conversion

NameParameterIs mustDefaultTypeInstruction
Uploaded AudiouploadtruefileThe audio file to be uploaded. It should be in wav or ogg
Source Role IDoriginal_idtrueintThe ID of the role used to upload the audio file.
Target Role IDtarget_idtrueintThe ID of the target role to convert the audio to.

HuBert-VITS

NameParameterIs mustDefaultTypeInstruction
Uploaded AudiouploadtruefileThe audio file to be uploaded. It should be in wav or ogg format.
Target speaker IDidtrueintThe target speaker ID.
Audio formatformattruestrwav,ogg,silk
Audio lengthlengthtruefloatAdjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed.
NoisenoisetruefloatSample noise, controlling the randomness of the synthesis.
sdp noisenoisewtruefloatStochastic Duration Predictor noise, controlling the length of phoneme pronunciation.

W2V2-VITS

NameParameterIs mustDefaultTypeInstruction
Synthesized texttexttruestrText needed for voice synthesis.
Speaker IDidfalseFrom config.yamlintThe speaker ID.
Audio formatformatfalseFrom config.yamlstrSupport for wav,ogg,silk,mp3,flac
Text languagelangfalseFrom config.yamlstrThe language of the text to be synthesized. Available options include auto, zh, ja, and mix. When lang=mix, the text should be wrapped in [ZH] or [JA].The default mode is auto, which automatically detects the language of the text
Audio lengthlengthfalseFrom config.yamlfloatAdjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed.
NoisenoisefalseFrom config.yamlfloatSample noise, controlling the randomness of the synthesis.
SDP noisenoisewfalseFrom config.yamlfloatStochastic Duration Predictor noise, controlling the length of phoneme pronunciation.
Segment Sizesegment_sizefalseFrom config.yamlintDivide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds segment_size. If segment_size<=0, the text will not be divided into paragraphs.
Dimensional emotionemotionfalse0intThe range depends on the emotion reference file in npy format, such as the range of the innnky's model all_emotions.npy, which is 0-5457.

Dimensional emotion

NameParameterIs mustDefaultTypeInstruction
Uploaded AudiouploadtruefileReturn the npy file that stores the dimensional emotion vectors.

Bert-VITS2

NameParameterIs mustDefaultTypeInstruction
Synthesized texttexttruestrText needed for voice synthesis.
Speaker IDidfalseFrom config.yamlintThe speaker ID.
Audio formatformatfalseFrom config.yamlstrSupport for wav,ogg,silk,mp3,flac
Text languagelangfalseFrom config.yamlstr"Auto" is a mode for automatic language detection and is also the default mode. However, it currently only supports detecting the language of an entire text passage and cannot distinguish languages on a per-sentence basis. The other available language options are "zh" and "ja".
Audio lengthlengthfalseFrom config.yamlfloatAdjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed.
NoisenoisefalseFrom config.yamlfloatSample noise, controlling the randomness of the synthesis.
SDP noisenoisewfalseFrom config.yamlfloatStochastic Duration Predictor noise, controlling the length of phoneme pronunciation.
Segment Sizesegment_sizefalseFrom config.yamlintDivide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds segment_size. If segment_size<=0, the text will not be divided into paragraphs.
SDP/DP mix ratiosdp_ratiofalseFrom config.yamlintThe theoretical proportion of SDP during synthesis, the higher the ratio, the larger the variance in synthesized voice tone.
EmotionemotionfalseFrom config.yamlintAvailable for Bert-VITS2 v2.1, ranging from 0 to 9
Emotion reference Audioreference_audiofalseNoneBert-VITS2 v2.1 uses reference audio to control the synthesized audio's emotion
Text Prompttext_promptfalseFrom config.yamlstrBert-VITS2 v2.2 text prompt used for emotion control
Style Textstyle_textfalseFrom config.yamlstrBert-VITS2 v2.3 text prompt used for emotion control
Style Text Weightstyle_weightfalseFrom config.yamlfloatBert-VITS2 v2.3 text prompt weight used for prompt weighting
Streaming responsestreamingfalsefalseboolStreamed synthesized speech with faster initial response.

GPT-SoVITS Speech Synthesis

NameParameterIs mustDefaultTypeInstruction
Synthesized texttexttruestrText needed for voice synthesis.
Speaker IDidfalseFrom config.yamlintSpeaker ID. In GPT-SoVITS, each model serves as a Speaker ID, and the voice is switched by reference audio presets.
Audio formatformatfalseFrom config.yamlstrSupport for wav, ogg, silk, mp3, flac
Text languagelangfalseFrom config.yamlstr"auto" is the automatic language detection mode, which is also the default mode. However, it currently only supports recognizing the language of the entire text passage, and cannot distinguish each sentence.
Reference Audioreference_audiofalseNonereference_audio is required, but it can be replaced by preset.
Reference Audio Textprompt_textfalseFrom config.yamlfloatNeed to be consistent with the actual text of the reference audio.
Reference Audio Languageprompt_langfalseFrom config.yamlstrDefaults to auto for automatic text language recognition. If recognition fails, manually fill in, zh for Chinese, ja for Japanese, en for English.
Reference Audio PresetpresetfalsedefaultstrReplace the reference audio with pre-set presets, multiple presets can be set.

SSML (Speech Synthesis Markup Language)

Supported Elements and Attributes

speak Element

AttributeInstructionIs must
idDefault value is retrieved From config.yamlfalse
langDefault value is retrieved From config.yamlfalse
lengthDefault value is retrieved From config.yamlfalse
noiseDefault value is retrieved From config.yamlfalse
noisewDefault value is retrieved From config.yamlfalse
segment_sizeSplits text into segments based on punctuation marks. When the sum of segment lengths exceeds segment_size, it is treated as one segment. segment_size<=0 means no segmentation. The default value is 0.false
model_typeDefault is VITS. Options: W2V2-VITS, BERT-VITS2false
emotionOnly effective when using W2V2-VITS . The range depends on the npy emotion reference file.false
sdp_ratioOnly effective when using BERT-VITS2 .false

voice Element

Higher priority than speak.

AttributeInstructionIs must
idDefault value is retrieved From config.yamlfalse
langDefault value is retrieved From config.yamlfalse
lengthDefault value is retrieved From config.yamlfalse
noiseDefault value is retrieved From config.yamlfalse
noisewDefault value is retrieved From config.yamlfalse
segment_sizeSplits text into segments based on punctuation marks. When the sum of segment lengths exceeds segment_size, it is treated as one segment. segment_size<=0 means no segmentation. The default value is 0.false
model_typeDefault is VITS. Options: W2V2-VITS, BERT-VITS2false
emotionOnly effective when using W2V2-VITS . The range depends on the npy emotion reference file.false
sdp_ratioOnly effective when using BERT-VITS2 .false

break Element

AttributeInstructionIs must
strengthx-weak, weak, medium (default), strong, x-strongfalse
timeThe absolute duration of a pause in seconds (such as 2s) or milliseconds (such as 500ms). Valid values range from 0 to 5000 milliseconds. If you set a value greater than the supported maximum, the service will use 5000ms. If the time attribute is set, the strength attribute is ignored.false
StrengthRelative Duration
x-weak250 ms
weak500 ms
medium750 ms
strong1000 ms
x-strong1250 ms

Reading

NameParameterIs mustDefaultTypeInstruction
Synthesis TexttexttruestrThe text to be synthesized into speech.
Interlocutor Model Typein_model_typefalseObtained from config.yamlstr
Interlocutor IDin_idfalseObtained from config.yamlint
Interlocutor Reference Audio PresetpresetfalsedefaultstrReplace the reference audio with preset settings, which can be set to multiple presets in advance.
Narrator Model Typenr_model_typefalseObtained from config.yamlstr
Narrator IDnr_idfalseObtained from config.yamlint
Narrator Reference Audio PresetpresetfalsedefaultstrReplace the reference audio with preset settings, which can be set to multiple presets in advance.
Audio FormatformatfalseObtained from config.yamlstrSupports wav, ogg, silk, mp3, flac
Text LanguagelangfalseObtained from config.yamlstr'auto' for automatic language detection mode, which is also the default mode. However, currently, it only supports recognizing the language of the entire text and cannot distinguish each sentence.
Reference Audio PresetpresetfalsedefaultstrReplace the reference audio with preset settings, which can be set to multiple presets in advance.

The other parameters of the model will use the default parameters of the corresponding model in the config.yaml file.

Example

See api_test.py

Communication

Learning and communication,now there is only Chinese QQ group

Acknowledgements

Thank You to All Contributors

<a href="https://github.com/artrajz/vits-simple-api/graphs/contributors" target="_blank"> <img src="https://contrib.rocks/image?repo=artrajz/vits-simple-api"/></a>