Awesome
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets
This is the repository of Our ACL'23 findings paper, A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets.
Yet another ChatGPT evaluation !!!! What's new ? This time not automatic, human in a loop. Our ACL'23 paper covers FULL eval on benchmarks that actually MATTERS !!!
We cover the largest ChatGPT evaluation so far (255K responses) on 140 tasks.
Please consider citing if you use the data or results from this paper.
@misc{laskar2023systematic,
title={A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets},
author={Md Tahmid Rahman Laskar and M Saiful Bari and Mizanur Rahman and Md Amran Hossen Bhuiyan and Shafiq Joty and Jimmy Xiangji Huang},
year={2023},
eprint={2305.18486},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Data
All the data can be downloaded from here.
Findings
Here is a short summary of the findings from the paper,
- As a general purpose instruction following multitask model, ChatGPT performs worse than the SOTA single task fine-tuned models. For targeted tasks, the fine-tuned model may still be preferable.
- The evaluation of ChatGPT-like LLMs should include human intervention instead of fully automatic evaluation.
- ChatGPT can often perform on par with an average human in Algorithmic Tasks.
-
For the same input prompt, different versions of ChatGPT may yield significantly different results.
-
Though the basic reasoning capability of ChatGPT is exceptional with Chain-of-thought (CoT) prompting, ChatGPT sometimes faces severe catastrophic forgetting in newly defined reasoning tasks when CoT prompting is not used..
- We also identify an interesting capability. There is a SHARP trend of this features at different scale !!!! ChatGPT can attend to multiple questions in a query and respond accordingly. However, adding many questions may reduce the model's performance. We name if as PolyQuery Synthesis.
- Though ChatGPT has multilingual capability, its performance in underrepresented languages is very low.
- Though ChatGPT's open-domain knowledge capability is extremely high, it often suffers in several Commonsense Reasoning tasks (e.g., PIQA, SIQA, HellaSwag, WinoGrande) compared to the competing models, such as, PaLM 540B and LLaMA 65B .
- For text summarization, the ChatGPT cannot outperform the current SOTA models based on the ROGUE metric. However, our annotators prefer ChatGPT's generated summaries over the SOTA models. . We find that our annotators prefer ChatGPT 78% times in CNN/DM and 92% times in XSUM. This suggests that we may need a new summarization metric to evaluate ChatGPT like instruction-tuned LLMs.
- ChatGPT has a very strong Zero-shot mathematical and coding capability in comparison to other LLMs.
- ChatGPT is found to be more ethical than prior SOTA models, while being less biased and more truthful.
- ChatGPT sometimes considers utilitarian morality and can respond to ethical dilemma-related queries.
Data Generation Process
We used promptsource to generate our evaluation data. The data shared in this repo are already in prompted format. If you are comparing these numbers in your work, for a fair comparison, please use the same data.