Awesome
RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models
We introduce RoleEval, a bilingual benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge. RoleEval comprises RoleEval-Global (including internationally recognized characters) and RoleEval-Chinese (including characters popular in China), with 6,000 Chinese-English parallel multiple-choice questions focusing on 300 influential people and fictional characters drawn from a variety of domains including celebrities, anime, comics, movies, TV series, games, and fiction. These questions cover basic knowledge and multi-hop reasoning abilities, aiming to systematically probe various aspects such as personal information, relationships, abilities, and experiences of the characters. To maintain high standards, we perform a hybrid quality check process combining automatic and human verification, ensuring that the questions are diverse, challenging, and discriminative.
NOTE: The English version of RoleEval is under internal review and will be released later.
Leaderboard (5-shot)
If you want to submit your model's predictions to our leaderboard, please feel free to contact us via thshen@tju.edu.cn for more details.
NOTE: * indicates the results calculated by submitted predictions.
RoleEval (zh)
RoleEval-Chinese (2,000 questions)
Model | Celebrities | Anime and Comics | Movies and TV Series | Games | Fiction | Avg. |
---|---|---|---|---|---|---|
Qwen-72B | 70.00 | 59.75 | 66.00 | 61.25 | 74.00 | 66.20 |
Baichuan-NPC-Turbo* | 66.25 | 61.00 | 71.50 | 54.25 | 76.25 | 65.85 |
Yi-34B | 65.50 | 54.50 | 70.00 | 56.00 | 77.00 | 64.60 |
GPT-4-1106 | 62.50 | 63.25 | 63.00 | 62.00 | 63.00 | 62.75 |
GPT-4-0613 | 57.75 | 60.25 | 57.75 | 60.00 | 58.00 | 58.75 |
Yi-6B | 59.25 | 46.00 | 61.50 | 47.75 | 62.00 | 55.30 |
Baichuan-NPC-Lite* | 56.00 | 51.75 | 56.75 | 47.50 | 62.00 | 54.80 |
MiniMax | 54.00 | 55.00 | 52.75 | 57.50 | 54.00 | 54.65 |
Qwen-14B | 56.25 | 45.50 | 54.75 | 51.50 | 56.75 | 52.95 |
Baichuan2-13B | 54.75 | 47.75 | 54.00 | 47.50 | 60.00 | 52.80 |
Skywork-13B | 55.25 | 45.75 | 56.00 | 48.50 | 57.50 | 52.60 |
Baichuan2-7B | 52.25 | 43.75 | 49.00 | 47.25 | 55.00 | 49.45 |
ChatGLM3-6B | 50.00 | 44.50 | 48.00 | 44.25 | 58.00 | 48.95 |
Qwen-7B | 49.00 | 42.00 | 47.50 | 44.75 | 51.25 | 46.90 |
GPT-3.5-1106 | 47.50 | 46.75 | 41.75 | 44.75 | 38.75 | 43.90 |
GPT-3.5-0613 | 42.25 | 43.50 | 39.75 | 43.75 | 39.00 | 41.65 |
Chinese-LLaMA-2-13B | 36.50 | 36.50 | 34.00 | 34.00 | 40.50 | 36.30 |
LLaMA-2-70B | 36.00 | 38.00 | 36.25 | 36.25 | 34.75 | 36.25 |
Chinese-LLaMA-2-7B | 34.50 | 29.00 | 33.00 | 30.25 | 36.25 | 32.60 |
Mistral-7B | 32.50 | 37.50 | 26.25 | 33.25 | 31.50 | 32.20 |
Falcon-40B | 28.25 | 33.00 | 30.25 | 29.25 | 38.50 | 31.85 |
LLaMA-65B | 30.00 | 32.25 | 29.00 | 35.50 | 29.00 | 31.15 |
LLaMA-2-7B | 25.75 | 28.00 | 33.75 | 29.75 | 34.50 | 30.35 |
LLaMA-30B | 30.00 | 28.75 | 26.00 | 31.75 | 28.00 | 28.90 |
LLaMA-2-13B | 28.75 | 30.50 | 25.25 | 29.75 | 28.25 | 28.50 |
Falcon-7B | 24.75 | 30.50 | 31.50 | 29.75 | 25.25 | 28.35 |
LLaMA-13B | 27.25 | 29.75 | 27.25 | 26.00 | 29.00 | 27.85 |
LLaMA-7B | 28.50 | 24.75 | 20.50 | 27.75 | 29.00 | 26.10 |
RoleEval-Global (4,000 questions)
Model | Celebrities | Anime and Comics | Movies and TV Series | Games | Fiction | Avg. |
---|---|---|---|---|---|---|
GPT-4-1106 | 74.75 | 73.62 | 74.38 | 72.50 | 71.62 | 73.38 |
GPT-4-0613 | 73.38 | 72.12 | 74.25 | 72.25 | 69.62 | 72.32 |
Qwen-72B | 72.88 | 63.88 | 70.38 | 56.75 | 73.50 | 67.47 |
Baichuan-NPC-Turbo* | 72.25 | 65.25 | 64.62 | 55.50 | 72.75 | 66.07 |
Yi-34B | 72.38 | 60.62 | 69.75 | 53.25 | 73.12 | 65.83 |
Baichuan-NPC-Lite* | 60.62 | 56.62 | 51.88 | 48.25 | 62.12 | 55.90 |
MiniMax | 51.75 | 54.50 | 62.62 | 56.75 | 52.75 | 55.67 |
Qwen-14B | 62.50 | 52.38 | 55.00 | 45.50 | 58.00 | 54.67 |
Yi-6B | 61.88 | 51.38 | 52.38 | 45.38 | 60.75 | 54.35 |
Baichuan2-13B | 60.25 | 52.38 | 51.00 | 46.88 | 60.75 | 54.25 |
Skywork-13B | 59.13 | 51.75 | 51.88 | 44.50 | 58.75 | 53.20 |
GPT-3.5-1106 | 48.75 | 51.88 | 51.25 | 49.88 | 48.38 | 50.02 |
ChatGLM3-6B | 56.50 | 47.62 | 48.38 | 41.88 | 54.50 | 49.78 |
Baichuan2-7B | 56.00 | 49.62 | 45.50 | 40.50 | 52.38 | 48.80 |
GPT-3.5-0613 | 46.62 | 48.38 | 51.75 | 49.50 | 47.38 | 48.73 |
Qwen-7B | 54.75 | 44.38 | 44.62 | 42.75 | 53.00 | 47.90 |
LLaMA-2-70B | 53.50 | 43.25 | 39.25 | 40.25 | 47.25 | 44.70 |
Chinese-LLaMA-2-13B | 45.38 | 38.25 | 39.88 | 31.87 | 42.12 | 39.50 |
Falcon-40B | 39.62 | 32.25 | 32.38 | 30.00 | 45.00 | 35.85 |
Chinese-LLaMA-2-7B | 35.62 | 36.75 | 35.62 | 35.38 | 34.38 | 35.55 |
LLaMA-2-7B | 37.00 | 29.88 | 28.75 | 34.50 | 38.25 | 33.67 |
LLaMA-2-13B | 36.50 | 34.00 | 33.00 | 31.87 | 31.75 | 33.42 |
Mistral-7B | 36.12 | 33.50 | 32.00 | 30.25 | 35.00 | 33.38 |
LLaMA-65B | 32.12 | 31.87 | 32.75 | 31.00 | 34.88 | 32.52 |
LLaMA-30B | 24.88 | 31.13 | 30.25 | 27.75 | 28.62 | 28.52 |
LLaMA-13B | 28.50 | 28.50 | 28.25 | 26.50 | 27.75 | 27.90 |
LLaMA-7B | 25.50 | 31.87 | 25.87 | 26.00 | 28.88 | 27.62 |
Falcon-7B | 23.88 | 28.12 | 24.50 | 28.00 | 28.12 | 26.52 |
RoleEval (en)
RoleEval-Chinese (2,000 questions)
Model | Celebrities | Anime and Comics | Movies and TV Series | Games | Fiction | Avg. |
---|---|---|---|---|---|---|
GPT-4-0613 | 54.25 | 61.75 | 63.00 | 63.00 | 63.00 | 61.00 |
GPT-4-1106 | 57.50 | 63.50 | 60.00 | 62.50 | 58.00 | 60.30 |
Yi-34B | 56.00 | 52.00 | 47.50 | 55.00 | 57.00 | 53.50 |
Qwen-72B | 52.75 | 47.50 | 46.50 | 54.25 | 50.50 | 50.30 |
GPT-3.5-0613 | 42.00 | 47.75 | 42.50 | 42.25 | 45.50 | 44.00 |
GPT-3.5-1106 | 38.25 | 45.50 | 44.00 | 44.50 | 46.00 | 43.65 |
LLaMA-2-70B | 43.25 | 41.50 | 40.25 | 47.50 | 43.50 | 43.20 |
Yi-6B | 42.25 | 38.50 | 41.50 | 44.25 | 45.00 | 42.30 |
Qwen-14B | 41.00 | 38.75 | 38.25 | 43.25 | 41.00 | 40.45 |
LLaMA-65B | 41.50 | 38.50 | 33.50 | 43.25 | 37.50 | 38.85 |
ChatGLM3-6B | 36.25 | 36.25 | 35.25 | 42.25 | 43.50 | 38.70 |
Skywork-13B | 39.25 | 34.50 | 38.25 | 41.75 | 38.50 | 38.45 |
MiniMax | 34.00 | 39.50 | 40.75 | 38.25 | 39.00 | 38.30 |
Qwen-7B | 36.25 | 36.00 | 36.25 | 42.25 | 40.00 | 38.15 |
Baichuan2-7B | 37.25 | 35.75 | 33.00 | 40.25 | 37.00 | 36.65 |
Mistral-7B | 35.75 | 42.00 | 30.00 | 41.75 | 31.50 | 36.20 |
Baichuan2-13B | 35.50 | 36.50 | 31.25 | 42.25 | 34.75 | 36.05 |
Falcon-40B | 34.00 | 38.25 | 30.75 | 38.75 | 35.25 | 35.40 |
LLaMA-30B | 34.75 | 35.75 | 30.75 | 40.00 | 35.00 | 35.25 |
Chinese-LLaMA-2-13B | 34.00 | 38.50 | 27.75 | 37.50 | 34.00 | 34.35 |
LLaMA-2-13B | 30.50 | 36.50 | 33.25 | 36.50 | 33.25 | 34.00 |
LLaMA-13B | 32.75 | 31.75 | 30.75 | 38.50 | 32.00 | 33.15 |
LLaMA-2-7B | 28.75 | 29.25 | 32.75 | 37.50 | 32.25 | 32.10 |
Chinese-LLaMA-2-7B | 30.50 | 27.75 | 33.00 | 30.50 | 27.75 | 29.90 |
LLaMA-7B | 24.00 | 27.50 | 29.75 | 33.00 | 29.25 | 28.70 |
Falcon-7B | 27.25 | 27.75 | 27.75 | 29.75 | 28.50 | 28.20 |
RoleEval-Global (4,000 questions)
Model | Celebrities | Anime and Comics | Movies and TV Series | Games | Fiction | Avg. |
---|---|---|---|---|---|---|
GPT-4-0613 | 77.62 | 79.50 | 73.12 | 74.88 | 75.00 | 76.02 |
GPT-4-1106 | 75.12 | 78.75 | 75.00 | 76.12 | 75.00 | 76.00 |
Yi-34B | 73.12 | 61.75 | 67.88 | 57.12 | 67.25 | 65.42 |
Qwen-72B | 70.12 | 62.00 | 69.00 | 55.75 | 69.50 | 65.27 |
LLaMA-2-70B | 63.25 | 57.38 | 59.00 | 50.00 | 63.25 | 58.58 |
GPT-3.5-0613 | 57.38 | 59.62 | 58.13 | 59.50 | 57.50 | 58.43 |
GPT-3.5-1106 | 58.75 | 56.62 | 55.75 | 58.00 | 55.00 | 56.82 |
MiniMax | 54.87 | 56.38 | 53.50 | 54.12 | 51.38 | 54.05 |
Yi-6B | 59.25 | 52.00 | 54.12 | 47.50 | 56.25 | 53.82 |
Qwen-14B | 61.12 | 49.00 | 53.87 | 45.38 | 56.12 | 53.10 |
LLaMA-65B | 58.13 | 50.50 | 54.37 | 47.62 | 54.50 | 53.02 |
Baichuan2-13B | 56.12 | 47.50 | 51.50 | 45.62 | 54.00 | 50.95 |
Skywork-13B | 56.25 | 46.75 | 51.62 | 44.38 | 53.62 | 50.52 |
Mistral-7B | 54.87 | 46.75 | 49.62 | 44.25 | 52.25 | 49.55 |
ChatGLM3-6B | 55.12 | 46.62 | 49.25 | 43.25 | 52.62 | 49.37 |
LLaMA-30B | 51.62 | 46.88 | 48.62 | 43.12 | 52.62 | 48.57 |
Qwen-7B | 53.87 | 46.12 | 48.12 | 40.00 | 51.12 | 47.85 |
Baichuan2-7B | 51.00 | 45.12 | 49.00 | 42.12 | 50.00 | 47.45 |
Falcon-40B | 47.38 | 45.00 | 49.62 | 43.12 | 50.00 | 47.02 |
Chinese-LLaMA-2-13B | 47.75 | 46.00 | 46.88 | 45.00 | 48.38 | 46.80 |
LLaMA-2-13B | 49.38 | 43.50 | 46.50 | 44.25 | 48.25 | 46.38 |
LLaMA-13B | 39.38 | 40.25 | 39.88 | 40.62 | 43.00 | 40.63 |
LLaMA-2-7B | 38.88 | 37.00 | 37.50 | 41.62 | 42.38 | 39.48 |
Chinese-LLaMA-2-7B | 36.50 | 30.75 | 31.75 | 36.25 | 39.50 | 34.95 |
LLaMA-7B | 29.38 | 30.50 | 29.25 | 33.50 | 28.50 | 30.23 |
Falcon-7B | 26.25 | 27.75 | 28.50 | 29.38 | 31.00 | 28.58 |
Citation
If you find our work useful, please cite our paper:
@article{shen2023roleeval,
title={RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models},
author={Tianhao Shen and Sun Li and Deyi Xiong},
year={2023},
eprint={2312.16132},
archivePrefix={arXiv},
primaryClass={cs.CL}
}