Home

Awesome

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

<p align="center"> <img src="./assets/logo.png" width="70%" height="70%"> </p>

<font size=7><div align='center' > [🍎 Project(Demo) Page] [📖 arXiv Paper] [🤗 Hugging Face] [💬 WeChat (微信)]</div></font></div></font>


🔥 News

Contents <!-- omit in toc -->

👀 Freeze-Omni Overview

Freeze-Omni is a speech-to-speech dialogue model, exhibiting the characteristic of being "smart" as it is constructed upon a "frozen" text-modality LLM. This enables it to keep the original intelligence of the LLM backbone, without being affected by the forgetting problem induced by the fine-tuning process for integration of the speech modality. Specifically, Freeze-Omni contains a speech encoder that supports streaming speech input and a speech decoder that generates streaming output speech. Three key strategies are adopted to implement the speech-to-speech dialogue system:

<p align="center"> <img src="./assets/overview.png" width="88%" height="88%"> </p>

Besides we implement a Model as a Server strategy. We first started several models simultaneously and regarded them as a server. Then, when a user's VAD was triggered, the speech would be sent to the server in the form of chunks, and the server would be responsible for scheduling which idle model should respond to the current chunk. Since we separated all the kv-cache and CNN cache of the speech encoder and LLM during the inference process, the server only needs to save the inference cache for each user. In this way, any model in the server could respond to any chunk of any user, and there was no need to specify which model was used as a monitor or a generator.

📈 Experimental Results

<p align="center"> <img src="./assets/asr_res.png" width="70%" height="70%"> </p> <p align="center"> <img src="./assets/out_cer.png" width="50%" height="50%"> </p> <p align="center"> <img src="./assets/qa.png" width="70%" height="70%"> </p> <p align="center"> <img src="./assets/latency.png" width="70%" height="70%"> </p>

📐 Inference

Requirements and Installation

Environment Requirements:

git clone https://github.com/VITA-MLLM/Freeze-Omni
cd Freeze-Omni
conda create -n freeze-omni python=3.10 -y
conda activate freeze-omni
pip install --upgrade pip
pip install -r requirements.txt

Required weights:

Quick Start

From python command

export PYTHONPATH=./:$PYTHONPATH
CUDA_VISIBLE_DEVICES=0 python3 bin/inference.py \
    --model_path ./checkpoints \
    --input_wav ./assets/question.wav \
    --output_wav ./assets/answer.wav \
    --llm_path ./Qwen2-7B-Instruct \
    --top_p 0.8 \
    --top_k 20 \
    --temperature 0.8

From script

sh scripts/run_inference.sh

Real-Time Interactive Demo

To have a good interactive experience, please pay attention to the following three points:

From python command

export PYTHONPATH=./:$PYTHONPATH
CUDA_VISIBLE_DEVICES=0 python3 bin/server.py \
    --ip your_server_ip \
    --port your_server_port \
    --max_users 3 \
    --llm_exec_nums 1 \
    --timeout 180 \
    --model_path ./checkpoints \
    --llm_path ./Qwen2-7B-Instruct \
    --top_p 0.8 \
    --top_k 20 \
    --temperature 0.8

From script

Change the ip and port in scripts/run_demo_server.sh with yours and run:

sh scripts/run_demo_server.sh

✒️ Citation

If you find our work helpful for your research, please consider citing our work.

@article{xiong2024freeze,
  title={Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM},
  author={Xiong Wang and Yangze Li and Chaoyou Fu and Yunhang Shen and Lei Xie and Ke Li and Xing Sun and Long Ma},
  journal={arXiv preprint arXiv:2411.00774},
  year={2024}
}

📣 Statement

Freeze-Omni is trained on large-scale corpus, and its output has randomness. Any content generated by Freeze-Omni does not represent the views of the model developers. We are not responsible for any problems arising from the use, misuse, and dissemination of Freeze-Omni, including but not limited to public opinion risks and data security issues.

📜 Related Works

Explore our related researches:

👍 Acknowledgement

Freeze-Omni is built with reference to the following outstanding works: Qwen2-7B-Instruct, TiCodec