Awesome

<div align="center"> <h2><i>ConsisEval:</i> A Hard-to-Easy Consistency Evaluation<br>Benchmark for Large Language Models</h2> </div>

This repo is for paper Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Overview

ConsisEval is developed to systematically evaluate the hard-to-easy consistency of LLMs. Here the hard-to-easy inconsistency refers to the counter-intuitive phenomenons where LLMs, while capable of solving hard problems, can paradoxically fail at easier ones.

ConsisEval includes 732 pair of questions from code (164), mathematics (298), and instruction-following (270) domains. It is noteworthy that there are only pairwise data in ConsisEval: one datum is comprised of two questions (an easy question and a harder one), and there is a strict order of difficulty between these two questions.

Data

Easy data is collected from gsm8k, IFEval and HumanEval.
hard data derived from easy data by automatic generation and human annotation.
ConsisEval (the combination of easy and hard data) is in directory data.

Evaluation Metric

Consistency Score (CS): conditional probability of a model correctly answering easy questions provided that it has correctly answered harder ones.

For more details about metrics, please refer to our paper.

Environments

All Python packages required to run are listed in requirements.txt, and can be installed by:

pip install -r requirements.txt

For evaluation on instruction-following domain, please run the following python code to download punkt:

>>> import nltk
>>> nltk.download('punkt')

Evalaution

Answer Generation

The code and script to generate answers is in eval.py and eval.sh, and proper arguments should be set before launch:

--model_name the name of evaluated model
--model_path the path to evaluated model (if not specified, model will be downloaded from HuggingFace)
--task the evaluated domain (only code, math and instruction_following are supported)
--sampling_times repeated sampling times for one question

After bash eval.sh, answers generated by evaluated model will be stored in ./log.

Metric Computation

The code and script to calculate metrics is in analysis.py and analysis.sh.

--model_name and --task should be specified for metric computation.

Citation

@misc{yang2024large,
      title={Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?}, 
      author={Zhe Yang and Yichang Zhang and Tianyu Liu and Jian Yang and Junyang Lin and Chang Zhou and Zhifang Sui},
      year={2024},
      eprint={2406.12809},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}