Awesome

Object-Oriented Programming Evaluation Benchmark for LLMs.

OOP Benchmark

OOP is a code generation benchmark to <b>quantify the object-oriented programming ability</b> of language Large Language Models (LLMs), and the details can be seen in our paper "OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models | [HuggingFace Link]". We collect code snippets from the LeetCode, open-source repositories on GitHub, Stack Overflow, and Codewars, and all the test samples have undergone carefully designed post-processing.

We show that 🔎:

⚠️ Despite excelling in functional programming (FP), e.g., HumanEval and MBPP, code-specialized LLMs like WizardCoder lag in our OOP compared to proprietary models like ChatGPT;
🚀 The poor performance of all advanced LLMs on our OOP benchmark highlights a critical need for improvements in this field.

Basic Statistics

OOP consists of 431 instances;
OOP contains three difficulty levels: Simple-level OOP, Moderate-level OOP, and Difficult-level OOP.

Performance of widely-used LLMs

Citations

Please cite the paper and star this repo if you use OOP and find it helpful. Feel free to contact wangshuai123@whu.edu.cn or open an issue if you have any questions.

@article{wang2024oop,
  title={OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models},
  author={Wang, Shuai and Ding, Liang and Shen, Li and Luo, Yong and Du, Bo and Tao, Dacheng},
  journal={arXiv preprint arXiv:2401.06628},
  year={2024}
}