Home

Awesome

image

SocialBench: Sociality Evaluation of Role-Playing Conversational Agents

<div align="center"> Hongzhan Chen<sup>1</sup>, Hehong Chen<sup>2</sup>, Ming Yan<sup>2*</sup>, Wenshen Xu<sup>2</sup>, Xing Gao<sup>2</sup>, Weizhou shen<sup>1</sup>, Xiaojun Quan<sup>1*</sup>, Chenliang Li<sup>2</sup>, Ji Zhang<sup>2</sup>, Fei Huang<sup>2</sup>, Jingren Zhou<sup>2</sup> </div> <div align="center"> chenhzh59@mail2.sysu.edu.cn, ym119608@alibaba-inc.com, quanxj3@mail.sysu.edu.cn </div> <div align="center"> <sup>1</sup>Sun Yat-sen University <sup>2</sup>Alibaba Group </div> <div align="center"> *Corresponding authors </div> <div align="center"> <a href="https://arxiv.org/pdf/2403.13679.pdf"><img src="assets/Paper-Arxiv-orange.svg" ></a> <a href="https://hits.seeyoufarm.com"><img src="https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FX-PLUG%2FMulti-LLM-Agent&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false"/></a> </div>

News

Introduction

Large language models (LLMs) have advanced the development of role-playing agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence.

In this work, we introduce SocialBench, the first benchmark designed to evaluate the sociality of role-playing agents, at both individual and group levels of social interactions. As we dive into the society of role-playing conversational agents, we find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group

Evaluation Dimensions

The evaluation dimensions of SocialBench include:

Statistics of SocialBench

Personality Traits

From the collection of 638 personality descriptors created by Gunkel (1998), we select and extend diverse personality traits for SocialBench's role profile construction.

image

Dialogue Tokens

There are a total of >500 roles, comprising >6,000 questions and >30,800 utterances in SocialBench. We show the distribution of dialogue tokens as below:

image

Data Structure

SocialBench is stored in JSON format, with the entire file being a list where each element is a dictionary. Each dictionary may contain the following fields:

Evaluation Scripts

We have provided an example of SocialBench usage in dataset.py.

Experimental Results

We utilize zero-shot prompting for all experiments, and only the chat version of the open-source LLMs are considered.

Open-Source LLMs

ModelSA StyleSA KnowEP Situ.EP Emo.CM ShortCM LongSP Pos.SP Neu.SP Neg.Avg
LLaMA-2-7B-Chat48.7651.2331.2328.9125.3821.8944.9824.1927.6733.80
LLaMA-2-13B-Chat57.6265.5137.1232.5630.4329.8266.3842.2526.2743.11
LLaMA-2-70B-Chat67.6170.7835.7438.4745.5726.7469.8745.2939.3748.83
Mistral-7B-Instruct-V0.250.1261.1736.4831.7231.7825.4265.6746.3428.9641.96
Qwen-7B-Chat66.4471.1641.6840.6867.4553.4575.6152.7843.1156.93
Qwen-14B-Chat77.0686.1545.7143.7865.3251.3778.3258.2559.2162.80
Qwen-72B-Chat83.8790.6453.1052.8983.2973.1591.5373.4463.8273.97

Closed-Source LLMs

ModelSA StyleSA KnowEP Situ.EP Emo.CM ShortCM LongSP Pos.SP Neu.SP Neg.Avg
GPT-4-Turbo84.5793.1156.4853.0581.3980.1189.7381.6975.1077.25
GPT-3.5-Turbo73.1773.8252.4445.4973.0359.7281.5976.7954.1665.58
Qwen-Max82.0493.3461.1452.3676.4572.6587.2272.1452.1972.17
Xingchen-Plus85.4391.6055.4460.7382.4380.6994.2786.6977.2679.39
Baichuan-NPC-Turbo53.6961.6752.1443.3476.4722.4062.0948.9734.5950.59
Baichuan-2-Turbo77.7583.3555.7047.3880.1178.9187.3774.7168.5072.64
CharGLM-374.7079.4126.2341.2781.1668.2984.4070.4536.3662.47
GLM-3-Turbo77.8584.6235.5853.0574.6471.6884.4167.4754.5567.09
Minimax-abab5.5s-chat36.0942.1128.1547.9729.5519.3044.5941.0422.4534.58
Minimax-abab6-chat82.9287.4535.9051.3883.6080.2689.1279.5574.6573.87

Citation

@misc{chen2024socialbench,
      title={SocialBench: Sociality Evaluation of Role-Playing Conversational Agents}, 
      author={Hongzhan Chen and Hehong Chen and Ming Yan and Wenshen Xu and Xing Gao and Weizhou Shen and Xiaojun Quan and Chenliang Li and Ji Zhang and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2403.13679},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}