Home

Awesome

<div align="center"> <h2><i>ConsisEval:</i> A Hard-to-Easy Consistency Evaluation<br>Benchmark for Large Language Models</h2> </div> <!-- <p align="center"> | <b>Paper</b> | <b>Leaderboard</b> | </p> --> <!-- This is the repo for our paper: Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? -->

Overview

ConsisEval is developed to systematically evaluate the hard-to-easy consistency of LLMs. Here the hard-to-easy inconsistency refers to the counter-intuitive phenomenons where LLMs, while capable of solving hard problems, can paradoxically fail at easier ones.

ConsisEval includes 732 pair of questions from code (164), mathematics (298), and instruction-following (270) domains. It is noteworthy that there are only pairwise data in ConsisEval: one datum is comprised of two questions (an easy question and a harder one), and there is a strict order of difficulty between these two questions.

Data

Evaluation Metric

<!-- - Relative Consistency Score (RCS): the rank of CS among a series of models with similar capabilities, indicating the potential for consistency improvement at current capability. -->

For more details about metrics, please refer to our paper.

Environments

All Python packages required to run are listed in requirements.txt, and can be installed by:

pip install -r requirements.txt

For evaluation on instruction-following domain, please run the following python code to download punkt:

>>> import nltk
>>> nltk.download('punkt')

Evalaution

Answer Generation

The code and script to generate answers is in eval.py and eval.sh, and proper arguments should be set before launch:

After bash eval.sh, answers generated by evaluated model will be stored in ./log.

Metric Computation

The code and script to calculate metrics is in analysis.py and analysis.sh.

--model_name and --task should be specified for metric computation.

<!-- Besides, please make sure model-generated answers are in [log](./log) directory. -->

Citation

<!-- If you find the resources in this repository useful, please cite our paper: -->
@misc{yang2024large,
      title={Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?}, 
      author={Zhe Yang and Yichang Zhang and Tianyu Liu and Jian Yang and Junyang Lin and Chang Zhou and Zhifang Sui},
      year={2024},
      eprint={2406.12809},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}