Awesome
<div align="center"> <h2><i>ConsisEval:</i> A Hard-to-Easy Consistency Evaluation<br>Benchmark for Large Language Models</h2> </div> <!-- <p align="center"> | <b>Paper</b> | <b>Leaderboard</b> | </p> --> <!-- This is the repo for our paper: Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? -->- This repo is for paper Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?
Overview
ConsisEval is developed to systematically evaluate the hard-to-easy consistency of LLMs. Here the hard-to-easy inconsistency refers to the counter-intuitive phenomenons where LLMs, while capable of solving hard problems, can paradoxically fail at easier ones.
ConsisEval includes 732 pair of questions from code (164), mathematics (298), and instruction-following (270) domains. It is noteworthy that there are only pairwise data in ConsisEval: one datum is comprised of two questions (an easy question and a harder one), and there is a strict order of difficulty between these two questions.
Data
- Easy data is collected from gsm8k, IFEval and HumanEval.
- hard data derived from easy data by automatic generation and human annotation.
- ConsisEval (the combination of easy and hard data) is in directory
data
.
Evaluation Metric
- Consistency Score (CS): conditional probability of a model correctly answering easy questions provided that it has correctly answered harder ones.
For more details about metrics, please refer to our paper.
Environments
All Python packages required to run are listed in requirements.txt
, and can be installed by:
pip install -r requirements.txt
For evaluation on instruction-following domain, please run the following python code to download punkt
:
>>> import nltk
>>> nltk.download('punkt')
Evalaution
Answer Generation
The code and script to generate answers is in eval.py
and eval.sh
, and proper arguments should be set before launch:
--model_name
the name of evaluated model--model_path
the path to evaluated model (if not specified, model will be downloaded from HuggingFace)--task
the evaluated domain (onlycode
,math
andinstruction_following
are supported)--sampling_times
repeated sampling times for one question
After bash eval.sh
, answers generated by evaluated model will be stored in ./log.
Metric Computation
The code and script to calculate metrics is in analysis.py
and analysis.sh
.
--model_name
and --task
should be specified for metric computation.
Citation
<!-- If you find the resources in this repository useful, please cite our paper: -->@misc{yang2024large,
title={Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?},
author={Zhe Yang and Yichang Zhang and Tianyu Liu and Jian Yang and Junyang Lin and Chang Zhou and Zhifang Sui},
year={2024},
eprint={2406.12809},
archivePrefix={arXiv},
primaryClass={cs.CL}
}