Home

Awesome

SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems

<a href='https://0nutation.github.io/SpeechAgents.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2401.03945'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>

<p align="center"> <img src="images/mas.png" width="60%"> <br> </p>

Introduction

SpeechAgents is a multi-modal LLM based multi-agent system designed for human communication simulating. Different from current LLM-based multi-agent systems, SpeechAgents utilizes multi-modal LLM as the central control for individual agent and employ multi-modal signals as the medium for exchanged messages among agents. Additionally, we propose Multi-Agent Tuning to enhance the multi-agent capabilities of LLM without compromising general abilities. To strengthen and evaluate the effectiveness of human communication simulation, we build the Human-Communication Simulation Benchmark.<br> SpeechAgents demos are shown in our project page. As shown in the demos, SpeechAgents can generate human-like communication dialogues with consistent content, authentic rhythm, and rich emotions, which can accomplish tasks such as drama creation and audio novels generation.

<p align="center"> <img src="images/trajectory.png" width="95%"> <br> llustration of training and inference process of an individual agent in SpeechAgents. </p>

Code

We will soon open-source our codes and models, stay tuned!

Demo

https://github.com/0nutation/SpeechAgents/assets/89269252/13112de9-5c1b-4e4d-9655-4499d9fc610a

Citation