Home

Awesome

<h1 align="center">CLEVA: Chinese Language Models EVAluation Platform</h1>

<div align="center">

GitHub Repo stars License: CC BY-NC-ND 4.0 Ask me anything

🌐Website📜Paper [EMNLP 2023 Demo]📌Instructions •✉️<a href="mailto:clevaplat@gmail.com">Email</a>

English | 简体中文

</div>

🎯 Introduction

CLEVA is a Chinese Language Models EVAluation Platform developed by CUHK LaVi Lab. CLEVA would like to thank Shanghai AI Lab for the great collaboration in the process. The main features of CLEVA include:

The leaderboard is evaluated and maintained by CLEVA using new test data. Past leaderboard data (processed test samples, annotated prompt templates, etc.) are made available to users for local evaluation runs.

Overview

🔥 News

<a id="instructions"></a>

📌 Instructions

CLEVA has been integrated into HELM. CLEVA would like to thank Stanford CRFM HELM team for the support. Users can employ CLEVA's datasets, prompt templates, perturbations, and Chinese automatic metrics for local evaluations via HELM.

Note<br /> If you want to evaluate your models on CLEVA online, please contact us via clevaplat@gmail.com for authentication and check out 📘Documentation for API development.

🛠️ Installation

Users can refer to the installation guide of HELM for setting up the Python environment and dependencies (Python>=3.8).

<details> <summary><b>Installation Using Anaconda</b></summary>

Here is an example for installation using Anaconda:

Create the environment first:

# Create virtual environment
# Only need to run once
conda create -n cleva python=3.8 pip

# Activate the virtual environment
conda activate cleva

Then install the dependencies:

pip install crfm-helm
</details>

⚖️ Evaluation

Example command to evaluate gpt-3.5-turbo-0613 on CLEVA's Chinese-to-English translation task using HELM:

helm-run \
-r "cleva:model=openai/gpt-3.5-turbo-0613,task=translation,subtask=zh2en,prompt_id=0,version=v1,data_augmentation=cleva" \
--num-train-trials <num_trials> \
--max-eval-instances <max_eval_instances> \
--suite <suite_id>

Explanation of parameters in -r (run configuration):

For other parameters, please refer to HELM's tutorial.

The full list of available task, subtask, and prompt_id of CLEVA (version=v1) can be found in HELM's .conf file. Users can run the entire CLEVA evaluation suite using the following command (the running time for reproducing CLEVA results can be found in the paper):

helm-run \
-c src/helm/benchmark/presentation/run_specs_cleva_v1.conf \
--num-train-trials <num_trials> \
--max-eval-instances <max_eval_instances> \
--suite <suite_id>

Generally, setting --max-eval-instances to over 5000 ensures all CLEVA task data are used for evaluation.

📊 Reference Results

Comparison between the results obtained using HELM for evaluating gpt-3.5-turbo-0613 on selected CLEVA tasks (version=v1) and those from the CLEVA platform:

ScenarioMetricReproduced in HELMEvaluated by CLEVA
task=summarization,subtask=dialogue_summarizationROUGE-20.30450.3065
task=translation,subtask=en2zhSacreBLEU60.4859.23
task=fact_checkingExact Match0.45950.4528
task=bias,subtask=dialogue_region_biasMicro F10.56560.5589

Note<br /> The difference is mainly due to different random seeds resulting in different in-context demonstrations, and the ChatGPT versions used by CLEVA and HELM are not completely aligned.

⏬ Download Data

If you want to use CLEVA data for evaluation with your own code, you can download the data by:

bash download_data.sh

After a successful run, a folder named with the data version will be generated in the current directory, which contains the data of each task of CLEVA. You can specify the data version by passing arguments to download_data.sh. It is v1 by default.

🛂 License

CLEVA is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

You should have received a copy of the license along with this work. If not, see https://creativecommons.org/licenses/by-nc-nd/4.0/.

🖊️ Citation

Please cite our paper if you use CLEVA in your work:

@misc{li2023cleva,
      title={CLEVA: Chinese Language Models EVAluation Platform}, 
      author={Yanyang Li and Jianqiao Zhao and Duo Zheng and Zi-Yuan Hu and Zhi Chen and Xiaohui Su and Yongfeng Huang and Shijia Huang and Dahua Lin and Michael R. Lyu and Liwei Wang},
      year={2023},
      eprint={2308.04813},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}