Awesome

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

This is the repository of Our ACL'23 findings paper, A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets.

eval_benchmarks

Yet another ChatGPT evaluation !!!! What's new ? This time not automatic, human in a loop. Our ACL'23 paper covers FULL eval on benchmarks that actually MATTERS !!!

We cover the largest ChatGPT evaluation so far (255K responses) on 140 tasks.

datasets

Please consider citing if you use the data or results from this paper.

@misc{laskar2023systematic,
      title={A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets}, 
      author={Md Tahmid Rahman Laskar and M Saiful Bari and Mizanur Rahman and Md Amran Hossen Bhuiyan and Shafiq Joty and Jimmy Xiangji Huang},
      year={2023},
      eprint={2305.18486},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Data

All the data can be downloaded from here.

Findings

Here is a short summary of the findings from the paper,

As a general purpose instruction following multitask model, ChatGPT performs worse than the SOTA single task fine-tuned models. For targeted tasks, the fine-tuned model may still be preferable.

super_glue

The evaluation of ChatGPT-like LLMs should include human intervention instead of fully automatic evaluation.

human_in_a_loop

ChatGPT can often perform on par with an average human in Algorithmic Tasks.

bigbench

For the same input prompt, different versions of ChatGPT may yield significantly different results.
Though the basic reasoning capability of ChatGPT is exceptional with Chain-of-thought (CoT) prompting, ChatGPT sometimes faces severe catastrophic forgetting in newly defined reasoning tasks when CoT prompting is not used..

inverse

We also identify an interesting capability. There is a SHARP trend of this features at different scale !!!! ChatGPT can attend to multiple questions in a query and respond accordingly. However, adding many questions may reduce the model's performance. We name if as PolyQuery Synthesis.

polyquery polyquery_res

Though ChatGPT has multilingual capability, its performance in underrepresented languages is very low.

polyquery_res

Though ChatGPT's open-domain knowledge capability is extremely high, it often suffers in several Commonsense Reasoning tasks (e.g., PIQA, SIQA, HellaSwag, WinoGrande) compared to the competing models, such as, PaLM 540B and LLaMA 65B .

polyquery_res

For text summarization, the ChatGPT cannot outperform the current SOTA models based on the ROGUE metric. However, our annotators prefer ChatGPT's generated summaries over the SOTA models. . We find that our annotators prefer ChatGPT 78% times in CNN/DM and 92% times in XSUM. This suggests that we may need a new summarization metric to evaluate ChatGPT like instruction-tuned LLMs.

summarization

ChatGPT has a very strong Zero-shot mathematical and coding capability in comparison to other LLMs.

math

ChatGPT is found to be more ethical than prior SOTA models, while being less biased and more truthful.

truthfulqa ethics

ChatGPT sometimes considers utilitarian morality and can respond to ethical dilemma-related queries.

utilitarian

Data Generation Process

We used promptsource to generate our evaluation data. The data shared in this repo are already in prompted format. If you are comparing these numbers in your work, for a fair comparison, please use the same data.