Home

Awesome

OpenOmni Framework

Multimodal Open Source Framework for Conversational Agents Research and Development.

Demo Documentation Video Deployment Lint Status Paper



Motivation

The development of Multimodal Large Language Models (LLMs) is opening new frontiers in conversational agent research and applications. Multimodal end-to-end conversational agents represents a significant advancement in our pursuit of general AI. This progress, however, is not without its challenges. Balancing cost, accuracy, and latency remains a difficult task.

<div style="text-align: center;"> <img src="./docs/images/Triangle.jpg" alt="OpenOmni" width="300"/> </div>

GPT-4 has demonstrated the capability of a fully end-to-end multimodal model to handle complex multimodal inputs, including audio and images, and generate coherent and contextually appropriate responses in a timely manner. Compared to traditional approaches such as speech-to-text, text generation, and text-to-speech, which suffer from latency issues that hinder their real-world applicability, GPT-4 has shown the potential to overcome these challenges. This showcases the potential of advanced AI models to navigate the rough path towards a bright future in conversational agents.

Despite these advancements, achieving the right balance between cost, accuracy, and latency is a significant hurdle. This difficulty is the motivation behind our project. We aim to establish an open-source framework that allows researchers to easily test their contributions and collaborate effectively. By doing so, we hope to facilitate the research and development process, accelerating the journey towards more efficient and capable multimodal conversational Agent systems.


Video Demonstration

OpenOmni


Approaches

Traditional conversational AI

./docs/images/OpenOmni.jpg

The whole pipeline divided into several components, mainly including:

Before the era of LLM, Speech2Text and Text2Speech tasks are the most successful ones, delivering quite a few successful products. However, generate in-context response is a challenging task, which is a bottleneck for the whole pipeline, which is the Accuracy issue. Development of LLMs solves this problem directly.

However, it is still one more problem, which is the Latency issue, before it can really impact the real world seriously.

OpenAI GPT-4o

./docs/images/GPT-4o.jpg

Demonstration from OpenAI indicates that they have achieved acceptable Latency performance, which is around 250ms. At the same time, the Accuracy is also impressive, according to their released benchmark: GPT-4o Benchmark.

However, we noticed that there is a cable connected to the iPhone when they are doing the demonstration, and it is described to maintain stable and high speed internet connection. Which means, reasonable latency is achievable, however, not yet scalable to the real world. The delay of public release of GPT-4o with audio and video ability proves this point to some extent. Our testing with current GPT-4o endpoints indicates even the fastest pipeline combination with OpenAI API ( Whisper,GPT-3.5,OpenAI TTS) still have 5-8 seconds latency.

Hybrid Approach

./docs/images/VoiceE2E.jpg

Unlike GPT-4o, with audio, video, or/and text as input, which is challenging to implement, some researchers are exploring a hybrid approach, which is to take audio and/or text as input, and generate audio and/or text as output.

Moshi is one of the examples. Images and Videos normally are very large, and require a lot of computation power to process, this approaches can overcome this issue. Another advantage of this is that without take the video as input, it avoids the privacy issue, act not like a "Big Brother is Watch you", potentially is more acceptable by the public.

Summary

The research focus for multimodal conversational agents is shifting towards end-to-end models, however, it is a **money ** game. To achieve that, you will need intelligent people (which we have in research community), data, and money. Data and money is not evenly distributed within the research community, which makes the wider research community hard to catch up with the cutting edge research.

Sadly to say that, but reality is the rest research community without enough data and money can work towards the direction: adopt and applying conversational agents the into the real world, and make it more accessible to the public. Contributing to the bigger picture as a small piece of the puzzle.

This leads to the development and goal of OpenOmni Framework Project.

  1. Collaborative Efforts: Crowd efforts where everyone focuses on advancing their own part without reinventing the wheel.
  2. High Agility: A process that allows people to easily test their models, understand real application issues, and improve them.
  3. Gathering Data: More data to advance system development, especially for complex conversational scenarios. This includes understanding who is talking, the context, and the emotion, which is lacking in current datasets.

We want to:

Our ultimate goal is to:


System Design

Requirements

What the functions we need from the system perspective:

  1. Data Collection: Video and audio inputs are collected from hardware devices.
  2. Data Transfer: Data is sent to the API for downstream processing.
  3. Data Processing: The API uses Agents and ML models to process the data, generate responses.
  4. Reaction: The client side is notified to play the speech.

There are two key metrics we are focusing on:

Which will allow to react to user's query in a timely manner, and provide the most in-context response.

System Architecture

Architecture

Main Components

So as you can see from the system architecture, we have three main components: Client, API, and AI.

Client will be in charge of data acquisition, and data presentation. Which means the video and audio data will be collected from the client side, and the speech generated by the Agent will be played by the client side.

API will be the orchestrator, which will manage the models, and provide the API interface for the Client/Agent module to access. It also provides the access and CURD to the data sources (Neo4j, PostgreSQL and Hard Disk). It also includes an interface for end users (Researchers and Annotators) to interact with the system.

Agent module is seperated to make sure it can be scaled to use different computation resources, and easily to integrate new work in.

All modules are written in Python.

Client

Detailed information can be found in Client.

The client side will mainly in charge of

API

Detailed information can be found in API.

This is the Brain of the system, which will be in charge of:

Agent

Detailed information can be found in Agent.


Benchmark Examples

We provided benchmarks for both latency and accuracy.

Latency

When we log the time within the pipeline, we will log both timestamp information for critical time points, and the time duration it takes to finish a specific task. In this way, we can generate a latency report for each component within the pipeline, and also operate whether the latency is caused by the data transfer or the model inference process, so that we can further optimize the system. With the timestamp information of the time point, we can easily retrieve the timeline of the process, with good observability, it will allow us to debug easily, or notice the potential issues.

We provide two report interface for the latency:

You can configure the system to run the same datasets through different pipelines, and in the end evaluate the performance, do the comparison.

Accuracy

Accuracy probably is not the most correct name for it, as for the text generation task, you will want to see whether it is in-context, for text2speech task, you will want to check the emotion is proper or not. However, if we just use the general word to "Performance" to refer this, it will normally cause the confusion between the developers and researchers. So we stick to use this to refer to the model performance.

However, for most of the accuracy performance, it does not like the latency, the measurement is diverse for a task and between different tasks. For example, you will want to evaluate whether the generated text is in-context from the knowledge, reasoning, tone perspective. So we need more freedom here. And automatically to evaluate the tasks from these different perspectives is also a challenging and underexplored area.

So we decide to provide a customisable interface which will allow annotators to annotate the accuracy performance for the model under different measurement metrics. It can be evaluated by multiple annotators, and then the system will provide a summary report for the performance. You can also evaluate the overall performance for several rounds of conversation, and then generate a report for that.

We will show how to do this evaluation annotation in tutorial section.

You can track the evaluation process via our accuracy detail page, and then get a summary report after the evaluation annotation is done.


Deployment Options

Because the whole design is modular and decoupled, so you can use our system in several ways:

Client side can be deployed to very cheap devices like Raspberry Pi with camera, microphone and speaker, like show in Raspberry Pi

We have detailed instruction about how to deploy the system in different ways in Deployment.

Citation

If you find this package useful, please consider citing our work:

@misc{sun2024openomnicollaborativeopensource,
      title={OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents}, 
      author={Qiang Sun and Yuanyi Luo and Sirui Li and Wenxiao Zhang and Wei Liu},
      year={2024},
      eprint={2408.03047},
      archivePrefix={arXiv},
      primaryClass={cs.HC},
      url={https://arxiv.org/abs/2408.03047}, 
}