Home

Awesome

MC Evaluation

This is the repo for the paper Multiple-Choice Questions are Efficient and Robust LLM Evaluators.

Data

In data.tar.gz, there are three folders and five jsonl files:

In each of the three folders, there is one test.jsonl and one train.jsonl, which are the multiple-choice questions used in our paper.

The other five jsonl files contain the complete candidate pool for each problem generated by us, represented as a list of strings:

{
    "question":"Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "candidates":["72","84","96","292","4896","36","60","144","48","6","1800","30040"]
}

For all questions the first candidate in the list is the ground-truth answer.

Evaluation

To evaluate a model on one of the three datasets, run:

python run_mc.py --dataset gsm8k --model google/flan-t5-small

where the dataset can be either of gsm8k, math, and pythonio. The model argument can be a model name on Hugging Face, or a local directory.

Citation

@misc{zhang2024multiplechoice,
      title={Multiple-Choice Questions are Efficient and Robust LLM Evaluators}, 
      author={Ziyin Zhang and Lizhen Xu and Zhaokun Jiang and Hongkun Hao and Rui Wang},
      year={2024},
      eprint={2405.11966},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}