Home

Awesome

Overview

<p align="center"> <img src="image/title.png" width="800px"/> </p> <p align="center"> 🌐 <a href="https://openstellarteam.github.io/ChineseSimpleQA/" target="_blank">Website</a> • 🤗 <a href="https://huggingface.co/datasets/OpenStellarTeam/Chinese-SimpleQA" target="_blank">Hugging Face</a> • ⏬ <a href="#data" target="_blank">Data</a> • 📃 <a href="https://huggingface.co/datasets/OpenStellarTeam/Chinese-SimpleQA" target="_blank">Paper</a> • 📊 <a href="http://47.109.32.164/" target="_blank">Leaderboard</a> <br> <a href="https://github.com/OpenStellarTeam/ChineseSimpleQA/blob/master/README_zh.md"> 中文</a> | <a href="https://github.com/OpenStellarTeam/ChineseSimpleQA/blob/master/README.md">English </p>

Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, our benchmark covers 6 major topics with 99 diverse subtopics.

Please visit our website or check our paper for more details.

This is the evaluation repository for Chinese SimpleQA, which is forked from OpenAI's simple-evals, and it follows the MIT License.

<p align="center"> <img src="image/category_en.png" width="700px"/> </p>

🆕 News

💫 Instroduction

📊 Leaderboard

详见: 📊

<p align="center"> <img src="image/leaderboard1.png" width="800px"/> </p>

🛠️ Setup

Due to the optional dependencies, we're not providing a unified setup mechanism. Instead, we're providing instructions for each eval and sampler.

For HumanEval (python programming)

git clone https://github.com/openai/human-eval
pip install -e human-eval

For the OpenAI API:

pip install openai

For the Anthropic API:

pip install anthropic

⚖️ Evals

We provide three evaluation methods.

(1) The first method is based on simple-evals evaluation. The startup command is as follows:

python -m simple-evals.demo

This will launch evaluations through the OpenAI API.

(2) The second is a simple single evaluation script that we wrote from scratch. The startup command is as follows:

(3) We also integrated our Chinese SimpleQA benchmark into our forked OpenCompass. You can refer to the opencompass configuration script for evaluation

Citation

Please cite our paper if you use our dataset.

@misc{he2024chinesesimpleqachinesefactuality,
      title={Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models}, 
      author={Yancheng He and Shilong Li and Jiaheng Liu and Yingshui Tan and Weixun Wang and Hui Huang and Xingyuan Bu and Hangyu Guo and Chengwei Hu and Boren Zheng and Zhuoran Lin and Xuepeng Liu and Dekai Sun and Shirong Lin and Zhicheng Zheng and Xiaoyong Zhu and Wenbo Su and Bo Zheng},
      year={2024},
      eprint={2411.07140},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.07140}, 
}