Awesome

Xiezhi (獬豸) is a comprehensive evaluation suite for Language Models (LMs). It consists of 249587 multi-choice questions spanning 516 diverse disciplines and four difficulty levels, as shown below. Please check our paper for more details, and our website will be open later on.

We hope Xiezhi could help developers track the progress and analyze the important strengths/shortcomings of their LMs.

Leaderboard
Experiment Setting
Data
How To Run Your Own Test
TODO
Licenses
Citation

Leaderboard

Below are the ranking of models in 0-shot learning in our experiment setting. The metric we used is MRR score. The detail of our experiment setting please refer to our [Experiment Setting](#experiment setting). ✓ denotes human performance exceeds the state-of-the-art LLMs, whereas ✗ signifies LLMs have surpassed human performance.

Experiment Setting

Options Setting

All tested LLMs need to choose the best-fit answer from 50 options for each question. Each question is set up with 3 confusing options in addition to the correct answer, and another 46 options are randomly sampled from all options in all questions in Xiezhi. It is worth noting that it is possible to use WordNet, open source synonym databases, or other word construction methods to generate more confusing options. However, our experiments show that the performance of all LLMs declined dramatically when the number of options increased, even when using so many non-confusing options. This achieves our goal of exacerbating the performance gap between LLMs through new experimental settings.

Metric

In this section, we present mainly two experiment results: the overall performance of all LLMs across various benchmarks, and the ranking of the top eight 0-shot LLMs in 12 non-sensitive domain categories of the Xiezhi-Benchmark with the scores for top and average practitioners. For the 45 open-source models assessed in our evaluation, we calculated the probability of each model choosing every option using generative probabilities and then ranked all options accordingly based on the probabilities. Due to legal considerations, we only display the results of two publicly recognized API-based LLMs: ChatGPT and GPT-4, and we ask them to rank all given options through instructions. To represent the results of all ranking outcomes, we employed the Mean Reciprocal Rank (MRR) as the metric, which calculates the reciprocal rank of the correct answer. MRR closer to 1 indicates that the model is more capable of placing the correct answer at the front of the ranking, while it suggests that the LLM tends to place the correct answer at the bottom if it is closer to 0.

Data

Example of question in Xiezhi Speciality:

Example of question in Xiezhi Interdiscipline:

Example of our few-shot learning setting:

How To Run Your Own Test

The testing can be performed on a set of models including C-Eval, M3KE, MMLU, Xiezhi-Inter, and Xiezhi-Spec, all of which are contained within the ./Tester/model_test.py file.
Anyone can simply run ./Tester/test.sh to do the evaluation
For your own data, you need to rewrite the _get_data function in ./Tester/model_test.py

TODO

add results of traditional 4 options experiments setting
add results of more API-based models

Licenses

This work is licensed under a MIT License.

The Xiezhi dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citation

Please cite our paper if you use our dataset.

@article{gu2023xiezhi,
title={Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation}, 
author={Zhouhong, Gu and Xiaoxuan, Zhu and Haoning, Ye and Lin, Zhang and Jianchen, Wang and Sihang, Jiang and Zhuozhi, Xiong and Zihan, Li and Qianyu, He and Rui, Xu and Wenhao, Huang and Weiguo, Zheng and Hongwei, Feng and Yanghua, Xiao}
journal={arXiv:2304.11679},
year={2023}
}