Home

Awesome

AI chat (LLM) with text to speech (TTS)

Testing xtts_v2 when chatting with a Large language model (LLM). Ask the AI a question and the answer is read back to you.

https://github.com/Scthe/ai-chat-with-tts/assets/9325337/97c44378-b0b9-47cc-af5a-06c98d37bb5a

In the video we are asking the AI "Who is Michael Jordan?". The answer is read back to us. Uses xtts_v2 and voice cloning for a custom voice, which is the most computationally expensive option. I think it sounds nice! It can even pronounce "NBA". I like the emphasis on "the greatest". Can't argue with that.

Usage

Install ollama to access LLM models

  1. Download ollama from https://ollama.com/download.
  2. ollama pull gemma:2b. Pull model file e.g. gemma:2b.
  3. Verification:
    1. ollama show gemma:2b --modelfile. Inspect model file data.
    2. ollama run gemma:2b. Open the chat in the console to check everything is OK.

Running this app

  1. pip install -r requirements.txt. Install dependencies.
  2. pip install TTS. There is a chance you already have TTS installed, as it's required to train your models. In that case, check below for inject_external_torch_into_path.py. Otherwise, this step is required. Python 3.10 is recommended, TTS has some checks for this (fails on 3.12.2).
  3. (Optional) Install CUDA-enabled PyTorch.
  4. python.exe main.py --config "config_xtts.yaml". Start the app using ./config_xtts.yaml.
  5. Go to http://localhost:8080/index.html.

Alternatively, use python.exe main.py --config "config.yaml" for a much smaller tacotron2-DDC TTS model.

If you don't want to install PyTorch with CUDA again (2.7+ GB), add the correct directory to ./src/inject_external_torch_into_path.py

Config

You can find the config value descriptions in ./config.yaml. The most important are:

Enabling voice cloning

In ./config_xtts.yaml set tts.sample_of_cloned_voice_wav to point to WAV audio. E.g. sample_of_cloned_voice_wav: 'voice_to_clone.wav'. Requirements:

Audacity for audio editing works fine.

Other commands

./main.py starts the server. Various util tools are in ./tts_scripts.py (see examples in ./makefile):

FAQ

Q: Which models did you use?

Q: Why xtts_v2?

It performed best for me when I did the Hugging Face's blind tests.

Q: How is this app parallelized?

There are 2 variants, based on tts.chunk_size:

An interesting option is to use the GPU for LLM and the CPU for TTS. Unfortunately, depending on the TTS model, the CPU might struggle. And I'm not fluent in the Python threading model, but even if you push TTS to a separate thread, it will affect/starve the event loop. Yes, the code is already full async/await. No, it does not matter.

Q: How fast is it?

The first response is always the slowest. The models for LLM and TTS (and optional voice converter) have to be loaded into memory. LLM for me is near-instantious (RTX 3060 with 12GB VRAM and gemma:2b). TTS depends on the size of the model and voice cloning. xtts_v2 with voice cloning (as seen in the video above) is the most expensive option.

Q: I get nothing in response?

  1. Check that there are no other apps that have loaded models on GPU (video games, stable diffusion, etc.). Even if they don't do anything ATM, they still take VRAM.
  2. Close Ollama.
  3. Make sure VRAM usage is at 0.
  4. Start Ollama.
  5. Restart the app.
  6. Ask a question to load all models into VRAM.
  7. Check you are not running out of VRAM.

Q: How can I list the speakers available in my model?

See above the "Other commands" section.

References

I've copied the whole UI and a lot of backend from my previous project: retrieval-augmented generation with context.

External packages: