Home

Awesome

RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models

We introduce RoleEval, a bilingual benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge. RoleEval comprises RoleEval-Global (including internationally recognized characters) and RoleEval-Chinese (including characters popular in China), with 6,000 Chinese-English parallel multiple-choice questions focusing on 300 influential people and fictional characters drawn from a variety of domains including celebrities, anime, comics, movies, TV series, games, and fiction. These questions cover basic knowledge and multi-hop reasoning abilities, aiming to systematically probe various aspects such as personal information, relationships, abilities, and experiences of the characters. To maintain high standards, we perform a hybrid quality check process combining automatic and human verification, ensuring that the questions are diverse, challenging, and discriminative.

NOTE: The English version of RoleEval is under internal review and will be released later.

Leaderboard (5-shot)

If you want to submit your model's predictions to our leaderboard, please feel free to contact us via thshen@tju.edu.cn for more details.

NOTE: * indicates the results calculated by submitted predictions.

RoleEval (zh)

RoleEval-Chinese (2,000 questions)

ModelCelebritiesAnime and ComicsMovies and TV SeriesGamesFictionAvg.
Qwen-72B70.0059.7566.0061.2574.0066.20
Baichuan-NPC-Turbo*66.2561.0071.5054.2576.2565.85
Yi-34B65.5054.5070.0056.0077.0064.60
GPT-4-110662.5063.2563.0062.0063.0062.75
GPT-4-061357.7560.2557.7560.0058.0058.75
Yi-6B59.2546.0061.5047.7562.0055.30
Baichuan-NPC-Lite*56.0051.7556.7547.5062.0054.80
MiniMax54.0055.0052.7557.5054.0054.65
Qwen-14B56.2545.5054.7551.5056.7552.95
Baichuan2-13B54.7547.7554.0047.5060.0052.80
Skywork-13B55.2545.7556.0048.5057.5052.60
Baichuan2-7B52.2543.7549.0047.2555.0049.45
ChatGLM3-6B50.0044.5048.0044.2558.0048.95
Qwen-7B49.0042.0047.5044.7551.2546.90
GPT-3.5-110647.5046.7541.7544.7538.7543.90
GPT-3.5-061342.2543.5039.7543.7539.0041.65
Chinese-LLaMA-2-13B36.5036.5034.0034.0040.5036.30
LLaMA-2-70B36.0038.0036.2536.2534.7536.25
Chinese-LLaMA-2-7B34.5029.0033.0030.2536.2532.60
Mistral-7B32.5037.5026.2533.2531.5032.20
Falcon-40B28.2533.0030.2529.2538.5031.85
LLaMA-65B30.0032.2529.0035.5029.0031.15
LLaMA-2-7B25.7528.0033.7529.7534.5030.35
LLaMA-30B30.0028.7526.0031.7528.0028.90
LLaMA-2-13B28.7530.5025.2529.7528.2528.50
Falcon-7B24.7530.5031.5029.7525.2528.35
LLaMA-13B27.2529.7527.2526.0029.0027.85
LLaMA-7B28.5024.7520.5027.7529.0026.10

RoleEval-Global (4,000 questions)

ModelCelebritiesAnime and ComicsMovies and TV SeriesGamesFictionAvg.
GPT-4-110674.7573.6274.3872.5071.6273.38
GPT-4-061373.3872.1274.2572.2569.6272.32
Qwen-72B72.8863.8870.3856.7573.5067.47
Baichuan-NPC-Turbo*72.2565.2564.6255.5072.7566.07
Yi-34B72.3860.6269.7553.2573.1265.83
Baichuan-NPC-Lite*60.6256.6251.8848.2562.1255.90
MiniMax51.7554.5062.6256.7552.7555.67
Qwen-14B62.5052.3855.0045.5058.0054.67
Yi-6B61.8851.3852.3845.3860.7554.35
Baichuan2-13B60.2552.3851.0046.8860.7554.25
Skywork-13B59.1351.7551.8844.5058.7553.20
GPT-3.5-110648.7551.8851.2549.8848.3850.02
ChatGLM3-6B56.5047.6248.3841.8854.5049.78
Baichuan2-7B56.0049.6245.5040.5052.3848.80
GPT-3.5-061346.6248.3851.7549.5047.3848.73
Qwen-7B54.7544.3844.6242.7553.0047.90
LLaMA-2-70B53.5043.2539.2540.2547.2544.70
Chinese-LLaMA-2-13B45.3838.2539.8831.8742.1239.50
Falcon-40B39.6232.2532.3830.0045.0035.85
Chinese-LLaMA-2-7B35.6236.7535.6235.3834.3835.55
LLaMA-2-7B37.0029.8828.7534.5038.2533.67
LLaMA-2-13B36.5034.0033.0031.8731.7533.42
Mistral-7B36.1233.5032.0030.2535.0033.38
LLaMA-65B32.1231.8732.7531.0034.8832.52
LLaMA-30B24.8831.1330.2527.7528.6228.52
LLaMA-13B28.5028.5028.2526.5027.7527.90
LLaMA-7B25.5031.8725.8726.0028.8827.62
Falcon-7B23.8828.1224.5028.0028.1226.52

RoleEval (en)

RoleEval-Chinese (2,000 questions)

ModelCelebritiesAnime and ComicsMovies and TV SeriesGamesFictionAvg.
GPT-4-061354.2561.7563.0063.0063.0061.00
GPT-4-110657.5063.5060.0062.5058.0060.30
Yi-34B56.0052.0047.5055.0057.0053.50
Qwen-72B52.7547.5046.5054.2550.5050.30
GPT-3.5-061342.0047.7542.5042.2545.5044.00
GPT-3.5-110638.2545.5044.0044.5046.0043.65
LLaMA-2-70B43.2541.5040.2547.5043.5043.20
Yi-6B42.2538.5041.5044.2545.0042.30
Qwen-14B41.0038.7538.2543.2541.0040.45
LLaMA-65B41.5038.5033.5043.2537.5038.85
ChatGLM3-6B36.2536.2535.2542.2543.5038.70
Skywork-13B39.2534.5038.2541.7538.5038.45
MiniMax34.0039.5040.7538.2539.0038.30
Qwen-7B36.2536.0036.2542.2540.0038.15
Baichuan2-7B37.2535.7533.0040.2537.0036.65
Mistral-7B35.7542.0030.0041.7531.5036.20
Baichuan2-13B35.5036.5031.2542.2534.7536.05
Falcon-40B34.0038.2530.7538.7535.2535.40
LLaMA-30B34.7535.7530.7540.0035.0035.25
Chinese-LLaMA-2-13B34.0038.5027.7537.5034.0034.35
LLaMA-2-13B30.5036.5033.2536.5033.2534.00
LLaMA-13B32.7531.7530.7538.5032.0033.15
LLaMA-2-7B28.7529.2532.7537.5032.2532.10
Chinese-LLaMA-2-7B30.5027.7533.0030.5027.7529.90
LLaMA-7B24.0027.5029.7533.0029.2528.70
Falcon-7B27.2527.7527.7529.7528.5028.20

RoleEval-Global (4,000 questions)

ModelCelebritiesAnime and ComicsMovies and TV SeriesGamesFictionAvg.
GPT-4-061377.6279.5073.1274.8875.0076.02
GPT-4-110675.1278.7575.0076.1275.0076.00
Yi-34B73.1261.7567.8857.1267.2565.42
Qwen-72B70.1262.0069.0055.7569.5065.27
LLaMA-2-70B63.2557.3859.0050.0063.2558.58
GPT-3.5-061357.3859.6258.1359.5057.5058.43
GPT-3.5-110658.7556.6255.7558.0055.0056.82
MiniMax54.8756.3853.5054.1251.3854.05
Yi-6B59.2552.0054.1247.5056.2553.82
Qwen-14B61.1249.0053.8745.3856.1253.10
LLaMA-65B58.1350.5054.3747.6254.5053.02
Baichuan2-13B56.1247.5051.5045.6254.0050.95
Skywork-13B56.2546.7551.6244.3853.6250.52
Mistral-7B54.8746.7549.6244.2552.2549.55
ChatGLM3-6B55.1246.6249.2543.2552.6249.37
LLaMA-30B51.6246.8848.6243.1252.6248.57
Qwen-7B53.8746.1248.1240.0051.1247.85
Baichuan2-7B51.0045.1249.0042.1250.0047.45
Falcon-40B47.3845.0049.6243.1250.0047.02
Chinese-LLaMA-2-13B47.7546.0046.8845.0048.3846.80
LLaMA-2-13B49.3843.5046.5044.2548.2546.38
LLaMA-13B39.3840.2539.8840.6243.0040.63
LLaMA-2-7B38.8837.0037.5041.6242.3839.48
Chinese-LLaMA-2-7B36.5030.7531.7536.2539.5034.95
LLaMA-7B29.3830.5029.2533.5028.5030.23
Falcon-7B26.2527.7528.5029.3831.0028.58

Citation

If you find our work useful, please cite our paper:

@article{shen2023roleeval,
      title={RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models}, 
      author={Tianhao Shen and Sun Li and Deyi Xiong},
      year={2023},
      eprint={2312.16132},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}