Home

Awesome

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

This is the repository of Our ACL'23 findings paper, A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets.

eval_benchmarks

Yet another ChatGPT evaluation !!!! What's new ? This time not automatic, human in a loop. Our ACL'23 paper covers FULL eval on benchmarks that actually MATTERS !!!

We cover the largest ChatGPT evaluation so far (255K responses) on 140 tasks.

datasets

Please consider citing if you use the data or results from this paper.

@misc{laskar2023systematic,
      title={A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets}, 
      author={Md Tahmid Rahman Laskar and M Saiful Bari and Mizanur Rahman and Md Amran Hossen Bhuiyan and Shafiq Joty and Jimmy Xiangji Huang},
      year={2023},
      eprint={2305.18486},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Data

All the data can be downloaded from here.

Findings

Here is a short summary of the findings from the paper,

super_glue

human_in_a_loop

bigbench

inverse

polyquery polyquery_res

polyquery_res

polyquery_res polyquery_res

summarization

math

truthfulqa ethics

utilitarian

Data Generation Process

We used promptsource to generate our evaluation data. The data shared in this repo are already in prompted format. If you are comparing these numbers in your work, for a fair comparison, please use the same data.