Home

Awesome

<div align="center"> <img src="assets/calm_logo.png" width="300px"/> <br /> <br />

🌐 Website | πŸ“ƒ Report| πŸ“§ Welcome to join us by email at causalai@pjlab.org.cn

</div>

Causal Evaluation of Language Models (CaLM)

We introduce Causal evaluation of Language Models (CaLM), which, to the best of our knowledge, is the first comprehensive benchmark for evaluating the causal reasoning capabilities of language models. The CaLM framework establishes a foundational taxonomy consisting of four modules: causal target (i.e., what to evaluate), adaptation (i.e., how to obtain the results), metric (i.e., how to measure the results), and error (i.e., how to analyze the bad results).

<div align="center"> <img src="assets/calm_suite.png" width="800px"/> </div>

πŸ“£ News

[2024.8.8] CaLM Lite, a lightweight version of CaLM, is now available on OpenCompass and this repository. It utilizes a dataset of 9,200 items, approximately one-tenth the size of the original CaLM dataset. Users can now evaluate their model performance on CaLM Lite independently. For more detailed information, please refer to CaLM Lite.

[2024.5.1] Causal Evaluation of Language Models (CaLM) is released, including technical report, evaluation dataset and codebase.

🀩 Participate by Submitting Your Results!

We invite you to contribute to our project by submitting your model-generated results.

Additionally, we welcome contributions such as new models, prompts, datasets, and metrics. For more information, please contact us at causalai@pjlab.org.cn.

⌨️ Quick Start

installation

git clone https://github.com/OpenCausaLab/CaLM.git
conda create -n calm python=3.8
conda activate calm
pip install -r requirements.txt

Run Models and Save Results

First, download the model you want to run if it is open-source, or obtain the API key if the model is limited. Then put the dir or the api key of your model to model_configs, you can choose to put the dir or api key to default.json or create a file named {model}.json. We specify the model details and where to download the open-source models in model details.

python calm/run.py --models vicuna_33b -p zero-shot-IcL -t PCD-B_E-CARE_EN  -mcfg ./model_configs -o ./output

For CaLM Lite version, add argument -l or --lite_version:

python calm/run.py --models vicuna_33b -p zero-shot-IcL -t PCD-B_E-CARE_EN  -mcfg ./model_configs -o ./output -l

Required Arguments

Optional Arguments

Evaluate Results

python calm/evaluate.py --models vicuna_33b -p zero-shot-IcL -t PCD-B_E-CARE_EN -cm -ea -am -o ./output

Similarly, for CaLM Lite version, add argument -l or --lite_version.

Required Arguments

Optional Arguments

The CaLM Lite version supports metric computation for all tasks in the dataset. For model developers who would like to evaluate on the whole CaLM dataset, kindly reach out to us by email at causalai@pjlab.org.cn for your own models' evaluation requirements; we need the generated JSON files of model response (responses.json) for evaluation. We will communicate with you within three days and send the evaluation results later. For more details, refer to our submission guideline.

πŸ–ŒοΈ Available Models

Currently, we support the following 18 models. You can use them by entering their corresponding API names after -m, --models in the command line. Note that models such as ada (0.35B), babbage (1.3B), curie (6.7B), and davinci (175B) are excluded as their api are no longer supported by OpenAI.

To add your model, please submit a pull request and email us at causalai@pjlab.org.cn. For details on adding models to our benchmarks, see model details.

πŸ—„οΈ Available Datasets (Causal Tasks)

<img src="assets/causal_task.png">

We provide 92 datasets for causal evaluation, stored in the calm_dataset folder. The directory structure is:

β”œβ”€β”€ README.md
β”œβ”€β”€ LICENSE
β”œβ”€β”€ calm_dataset
| β”œβ”€β”€ causal_discovery # Rung of the causal ladder
| β”‚ β”œβ”€β”€ abstract_reasoning # Causal scenario
| β”‚ β”‚ β”œβ”€β”€ AR-B_(CaLM-AR)_CN.json # Causal task
| β”‚ | └── AR-B_(CaLM-AR)_EN.json # Causal task
| β”‚ └── ...
| └── ...
β”œβ”€β”€ calm # calm packages
└── ...

Each dataset represents a specific causal target in either English or Chinese. For an overview of the datasets, see tasks.

Note: The task dataset containing the word "natural" is identical to its corresponding task dataset containing the word "basic", e.g., "NDE-B_NDE-natural_CN" and "NDE-P_NDE-basic_CN" have the same content. We split them into two files for user convenience during running and evaluation.

For CaLM, We support the ground truth (GT) labels for public datasets with references, but keep our own dataset's GT labels unreleased for further use. If you want to evaluate your model on the whole dataset, kindly reach out to us by email at causalai@pjlab.org.cn. We will reply in 3 days. The responses.json files generated by your model will be required. For details, see submission guideline. For CaLM Lite, this repo supports the GT of all tasks in it.

If you want to add your own dataset, please submit a pull request and email us at causalai@pjlab.org.cn.

πŸ”‰ Available Prompt Styles (Adaptation)

For prompts in English (default), we use names such as basic, zero-shot-IcL. For prompt in Chinese, we add "-CN" to the prompt name, such as basic-CN, zero-shot-IcL-CN.

Supported prompt styles for most tasks include:

For certain tasks (ATE, CDE, ETT), due to token limitations, we removed three-shot-IcL-CN, resulting in the following supported prompts:

For NIE, NDE, PN, PS tasks, we deleted the three-shot-IcL-CN and replace the three-shot-IcL to two-shot-IcL, so the supported prompts are:

We welcome contributions to add new prompt styles via pull requests. Remember to further inform us at causalai@pjlab.org.cn.

πŸ“Š Available Metric and Errors

Currently we support the evaluation scripts for 7 metrics and 5 quantitative errors.

Metrics

Errors

We welcome contributions of new metrics or error types through pull requests. Please also email us at causalai@pjlab.org.cn to submit your pull request.

πŸ–‡οΈ Citation

@misc{chen2024causal,
      title={Causal Evaluation of Language Models}, 
      author={Sirui Chen and Bo Peng and Meiqi Chen and Ruiqi Wang and Mengying Xu and Xingyu Zeng and Rui Zhao and Shengjie Zhao and Yu Qiao and Chaochao Lu},
      year={2024},
      eprint={2405.00622},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}