Home

Awesome

Open-LLM-VTuber

中文

GitHub release license

BuyMeACoffee <- (Clickable links)

(QQ群: 792615362)<- way more active than Discord group with over 700 population and majority of the contributors

常见问题 Common Issues doc (Written in Chinese): https://docs.qq.com/doc/DTHR6WkZ3aU9JcXpy

User Survey: https://forms.gle/w6Y6PiHTZr1nzbtWA

调查问卷(中文)(现在不用登入了): https://wj.qq.com/s2/16150415/f50a/

:warning: This project is in its early stages and is currently under active development. Features are unstable, code is messy, and breaking changes will occur. The main goal of this stage is to build a minimum viable prototype using technologies that are easy to integrate.

:warning: This project is NOT easy to install. Join the Discord server or QQ group if you need help or to get updates about this project.

:warning: If you want to run this program on a server and access it remotely on your laptop, the microphone on the front end will only launch in a secure context (a.k.a. https or localhost). See MDN Web Doc. Therefore, you should configure https with a reverse proxy to access the page on a remote machine (non-localhost).

You are right if you think this README and the docs are super duper messy! A complete refactoring of the documentation is planned. In the meantime, you can watch the installation videos if you speak Chinese.

❓ What is this project?

Open-LLM-VTuber allows you to talk to (and interrupt!) any LLM locally by voice (hands-free) with a Live2D talking face. The LLM inference backend, speech recognition, and speech synthesizer are all designed to be swappable. This project can be configured to run offline on macOS, Linux, and Windows. Online LLM/ASR/TTS options are also supported.

Long-term memory with MemGPT can be configured to achieve perpetual chat, infinite* context length, and external data source.

This project started as an attempt to recreate the closed-source AI VTuber neuro-sama with open-source alternatives that can run offline on platforms other than Windows.

<img width="500" alt="demo-image" src="https://github.com/t41372/Open-LLM-VTuber/assets/36402030/fa363492-7c01-47d8-915f-f12a3a95942c"/>

Demo

English demo:

https://github.com/user-attachments/assets/f13b2f8e-160c-4e59-9bdb-9cfb6e57aca9

English Demo: YouTube

中文 demo:

BiliBili, YouTube

Why this project and not other similar projects on GitHub?

Basic Features

Target Platform

Recent Feature Updates

Check out the GitHub Release for updated notes.

Implemented Features

Currently supported LLM backend

Currently supported Speech recognition backend

Currently supported Text to Speech backend

Fast Text Synthesis

Live2D Talking face

live2d technical details

Quick Start

If you speak Chinese, there are two installation videos for you.

If you don't speak Chinese, good luck. Let me know if you create on in other languages so I can put it here.

New installation instruction is being created here

One-click gogo script

A new quick start script (experimental) was added in v0.4.0. This script allows you to get this project running without worrying (too much) about the dependencies. The only thing you need for this script is Python, a good internet connection, and enough disk space.

This script will do the following:

Run the script with python start_webui.py. Note that you should always use start_webui.py as the entry point if you decide to use the auto-installation script because server.py doesn't start the conda environment for you.

Also note that if you want to install other dependencies, you need to enter the auto-configured conda environment first by running python activate_conda.py

Manual installation

In general, there are 4 steps involved in getting this project running:

  1. basic setup
  2. Get the LLM (large language model)
  3. Get the TTS (text-to-speech)
  4. Get the ASR (speech recognition)

Requirements:

Clone this repository.

Virtual Python environment like conda or venv is strongly recommended! (because the dependencies are a mess!).

Run the following in the terminal to install the basic dependencies.

pip install -r requirements.txt # Run this in the project directory 
# Install Speech recognition dependencies and text-to-speech dependencies according to the instructions below

Edit the conf.yaml for configurations. You can follow the configuration used in the demo video.

Once the live2D model appears on the screen, it's ready to talk to you.

~~If you don't want the live2d, you can run main.py with Python for cli mode. ~~ (CLI mode is deprecated now and will be removed in v1.0.0. If some still want the cli mode, maybe we can make a cli client in the future, but the current architecture will be refactored very soon)

Some models will be downloaded on your first launch, which may require an internet connection and may take a while.

Update

🎉 A new experimental update script was added in v0.3.0. Run python upgrade.py to update to the latest version.

Back up the configuration files conf.yaml if you've edited them, and then update the repo. Or just clone the repo again and make sure to transfer your configurations. The configuration file will sometimes change because this project is still in its early stages. Be cautious when updating the program.

Configure LLM

OpenAI compatible LLM such as Ollama, LM Studio, vLLM, groq, ZhiPu, Gemini, OpenAI, and more

Put ollama into LLM_PROVIDER option in conf.yaml and fill the settings.

If you use the official OpenAI API, the base_url is https://api.openai.com/v1.

Claude

Claude support was added in v0.3.1 in https://github.com/t41372/Open-LLM-VTuber/pull/35

Change the LLM_PROVIDER to claude and complete the settings under claude

LLama CPP (added in v0.5.0-alpha.2)

Provides a way to run LLM within this project without any external tools like ollama. A .gguf model file is all you need.

Requirements

According to the project repo

Requirements:

This will also build llama.cpp from the source and install it alongside this Python package.

If this fails, add --verbose to the pip install see the full cmake build log.

Installation

Find the pip install llama-cpp-python command for your platform here.

For example:

if you use an Nvidia GPU, run this.

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

If you use an apple silicon Mac (like I do), do this:

CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

If you use an AMD GPU that supports ROCm:

CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python

If you want to use CPU (OpenBlas):

CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python

For more options, check here.

MemGPT (Broken and will probably be removed rather than fixed)

:warning: MemGPT was renamed to Letta, and they changed their API. Currently, the integration of MemGPT in this project has not been updated with the latest changes, so the integration is broken. It probably won't get fixed because MemGPT (or Letta now) is quite slow and unstable for local LLMs. A new long-term memory solution is planned.

However, you can still get the old version of MemGPT and try it out. Here is the documentation.

MemGPT integration is very experimental and requires quite a lot of setup. In addition, MemGPT requires a powerful LLM (larger than 7b and quantization above Q5) with a lot of token footprint, which means it's a lot slower. MemGPT does have its own LLM endpoint for free, though. You can test things with it. Check their docs.

This project can use MemGPT as its LLM backend. MemGPT enables LLM with long-term memory.

To use MemGPT, you need to have the MemGPT server configured and running. You can install it using pip or docker or run it on a different machine. Check their GitHub repo and official documentation.

:warning: I recommend you install MemGPT either in a separate Python virtual environment or in docker because there is currently a dependency conflict between this project and MemGPT (on fast API, it seems). You can check this issue Can you please upgrade typer version in your dependancies #1382.

Here is a checklist:

Mem0 (it turns out it's not very good for our use case, but the code is here...)

Another long-term memory solution. Still in development. Highly experimental.

Pro

Cons

Install Speech Recognition (ASR)

Edit the ASR_MODEL settings in the conf.yaml to change the provider.

Here are the options you have for speech recognition:

sherpa-onnx (local, runs very fast) (added in v0.5.0-alpha.1 in https://github.com/t41372/Open-LLM-VTuber/pull/50)

FunASR (local) (Runs very fast even on CPU. Not sure how they did it)

Faster-Whisper (local)

WhisperCPP (local) (runs super fast on a Mac if configured correctly)

WhisperCPP coreML configuration:

Whisper (local)

GroqWhisperASR (online, API Key required)

AzureASR (online, API Key required)

Install Speech Synthesis (text to speech) (TTS)

Install the respective package and turn it on using the TTS_MODEL option in conf.yaml.

sherpa-onnx (local) (added in v0.5.0-alpha.1 in https://github.com/t41372/Open-LLM-VTuber/pull/50)

pyttsx3TTS (local, fast)

meloTTS (local, fast)

coquiTTS (local, can be fast or slow depending on the model you run)

GPT_Sovits (local, medium fast) (added in v0.4.0 in https://github.com/t41372/Open-LLM-VTuber/pull/40)

barkTTS (local, slow)

cosyvoiceTTS (local, slow)

xTTSv2 (local, slow) (added in v0.2.4 in https://github.com/t41372/Open-LLM-VTuber/pull/23)

edgeTTS (online, no API key required)

fishAPITTS (online, API key required) (added in v0.3.0-beta)

AzureTTS (online, API key required) (This is the exact same TTS used by neuro-sama)

If you're using macOS, you need to enable the microphone permission of your terminal emulator (you run this program inside your terminal, right? Enable the microphone permission for your terminal). If you fail to do so, the speech recognition will not be able to hear you because it does not have permission to use your microphone.

VAD Tuning

For web interface, this project utilizes client-side Voice Activity Detection (VAD) using the ricky0123/vad-web library for efficient speech detection.

Web Interface Controls:

The following settings are available in the web interface to fine-tune the VAD:

Tuning Tips:

Experiment with these parameters to find the optimal balance between sensitivity and accuracy for your environment and speaking style.

Some other things

Translation

Translation was implemented to let the program speak in a language different from the conversation language. For example, the LLM might be thinking in English, the subtitle is in English, and you are speaking English, but the voice of the LLM is in Japanese. This is achieved by translating the sentence before it's sent for audio generation.

DeepLX is the only supported translation backend for now. You will need to deploy the deeplx service and set the configuration in conf.yaml to use it.

If you want to add more translation providers, they are in the translate directory, and the steps are very similar to adding new TTS or ASR providers.

Enable Audio Translation

  1. Set TRANSLATE_AUDIO in conf.yaml to True
  2. Set DEEPLX_TARGET_LANG to your desired language. Make sure this language matches the language of the TTS speaker (for example, if the DEEPLX_TARGET_LANG is "JA", which is Japanese, the TTS should also be speaking Japanese.).

Issues

PortAudio Missing

Running in a Container [highly experimental]

:warning: This is highly experimental, but I think it works. Most of the time.

You can either build the image yourself or pull it from the docker hub.

Current issues:

Most of the ASR and TTS will be pre-installed. However, bark TTS and the original OpenAI Whisper (Whisper, not WhisperCPP) are NOT included in the default build process because they are huge (~8GB, which makes the whole container about 25GB). In addition, they don't deliver the best performance either. To include bark and/or whisper in the image, add the argument --build-arg INSTALL_ORIGINAL_WHISPER=true --build-arg INSTALL_BARK=true to the image build command.

Setup guide:

  1. Review conf.yaml before building (currently burned into the image, I'm sorry):

  2. Build the image:

docker build -t open-llm-vtuber .

(Grab a drink, this will take a while)

  1. Grab a conf.yaml configuration file. Grab a conf.yaml file from this repo. Or you can get it directly from this link.

  2. Run the container:

$(pwd)/conf.yaml should be the path of your conf.yaml file.

docker run -it --net=host --rm -v $(pwd)/conf.yaml:/app/conf.yaml -p 12393:12393 open-llm-vtuber
  1. Open localhost:12393 to test

🎉🎉🎉 Related Projects

ylxmf2005/LLM-Live2D-Desktop-Assitant

🛠️ Development

(this project is in the active prototyping stage, so many things will change)

Some abbreviations used in this project:

Regarding sample rates

You can assume that the sample rate is 16000 throughout this project. The frontend stream chunks of Float32Array with a sample rate of 16000 to the backend.

Add support for new TTS providers

  1. Implement TTSInterface defined in tts/tts_interface.py.
  2. Add your new TTS provider into tts_factory: the factory to instantiate and return the tts instance.
  3. Add configuration to conf.yaml. The dict with the same name will be passed into the constructor of your TTSEngine as kwargs.

Add support for new Speech Recognition provider

  1. Implement ASRInterface defined in asr/asr_interface.py.
  2. Add your new ASR provider into asr_factory: the factory to instantiate and return the ASR instance.
  3. Add configuration to conf.yaml. The dict with the same name will be passed into the constructor of your class as kwargs.

Add support for new LLM provider

  1. Implement LLMInterface defined in llm/llm_interface.py.
  2. Add your new LLM provider into llm_factory: the factory to instantiate and return the LLM instance.
  3. Add configuration to conf.yaml. The dict with the same name will be passed into the constructor of your class as kwargs.

Add support for new Translation providers

  1. Implement TranslateInterface defined in translate/translate_interface.py.
  2. Add your new TTS provider into translate_factory: the factory to instantiate and return the tts instance.
  3. Add configuration to conf.yaml. The dict with the same name will be passed into the constructor of your translator as kwargs.

Acknowledgement

Awesome projects I learned from

Star History

Star History Chart