Home

Awesome

OlympiadBench

<p align="center"> <img src="resources/title.png" style="width: 95%;" id="title-icon"> </p> <p align="center"> 📄 <a href="https://arxiv.org/abs/2402.14008" target="_blank">Paper</a> &nbsp; | &nbsp; 🤗 <a href="https://huggingface.co/datasets/Hothan/OlympiadBench" target="_blank">Hugging Face</a> </p>

This repo contains the evaluation code for the paper "OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems"

News!

Leaderboard

Experiment with full benchmark

ModelMathPhysicsAvg.
GPT-4o32.4813.1025.89
GPT-4V21.7010.7417.97
Qwen-VL-Max12.655.0910.09
Claude3-Opus9.064.937.65
Gemini-Pro-Vision5.142.454.22
Yi-VL-34B4.231.463.42
LLaVA-NeXT-34B4.302.083.65

Experiment with text-only problems

ModelMathPhysicsAvg.
GPT-4o41.5427.6439.72
GPT-432.0016.2429.93
GPT-4V31.0116.2429.07
Qwen-VL-Max19.708.8318.27
Claude3-Opus13.4310.8313.09
Gemini-Pro-Vision7.635.417.34
Llama-3-70B-Instruct20.9215.9520.27
DeepSeekMath-7B-RL18.099.9717.02
Yi-VL-34B6.242.285.72
LLaVA-NeXT-34B6.293.135.87

Overview

We introduce OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark. Notably, the best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics, highlighting the benchmark rigor and the intricacy of physical reasoning.

<p align="center"><img src="resources/imo_example.png" style="width: 85%;"></p>

Data process

<p align="center"><img src="resources/data_process.png" style="width: 85%;"></p>

This collection comprises 8,476 math and physics problems sourced from:

<!-- Comparisons with related benchmarks are as follows, which show OlympiadBench has a significant advantage. <p align="center"> <img src="resources/comparison.png" style="width: 85%;"> </p> -->

We use Mathpix OCR to parse official PDFs, then meticulously inspect, clean, revise and dedupe the data. Finally, we annotate the data with crucial information such as answer types and subfields, yielding a dataset that is clean, accurate, and detailed. OlympiadBench includes open-ended questions and proof problems. For the open-ended questions, we standardize the answer format and develop an automated scoring pipeline here. For the proof problems, we conduct sample assessments.

<!-- ![statistics of olympiadbench](resources/Statistics_of_OlympiadBench.png) --> <p align="center"><img src="resources/new_Statistics_of_OlympiadBench.png" style="width: 85%;"></p> <p align="center"><img src="resources/pipeline.png" style="width: 85%;"></p>

The downloaded dataset contains two folders, data and images. The data contains the categorized data. For example, OE_MM_physics_en_COMP.json, TP_TO_maths_zh_CEE.json.

  * OE: Open-ended questions
  * TP: Theorem proof problems
  * MM: Multimodal
  * TO: Text-only
  * physics: Physics problems
  * maths: Math problems
  * en: English
  * zh: Chinese
  * COMP: Competition problems
  * CEE: Chinese College Entrance Exam problems

images contains the corresponding images in data.

The data format for all datasets is as follows:

  {
        "id": 2231,
        "subfield": "Geometry",
        "context": null,
        "question": "Turbo the snail sits on a point on a circle with circumference 1. Given an infinite sequence of positive real numbers $c_{1}, c_{2}, c_{3}, \\ldots$. Turbo successively crawls distances $c_{1}, c_{2}, c_{3}, \\ldots$ around the circle, each time choosing to crawl either clockwise or counterclockwise.\n\nFor example, if the sequence $c_{1}, c_{2}, c_{3}, \\ldots$ is $0.4,0.6,0.3, \\ldots$, then Turbo may start crawling as follows:\n<img_3362>\n\nDetermine the largest constant $C>0$ with the following property: for every sequence of positive real numbers $c_{1}, c_{2}, c_{3}, \\ldots$ with $c_{i}<C$ for all $i$, Turbo can (after studying the sequence) ensure that there is some point on the circle that it will never visit or crawl across.",
        "solution": [
        "The largest possible $C$ is $C=\\frac{1}{2}$.\n\nFor $0<C \\leqslant \\frac{1}{2}$, ...... that we cannot force Chet out of $[-1+\\varepsilon, 1-\\varepsilon]$. Hence $M \\geqslant 2$ as needed."
        ],
        "final_answer": [
        "$\\frac{1}{2}$"
        ],
        "is_multiple_answer": false,
        "unit": null,
        "answer_type": "Numerical",
        "error": null
  }

Experiments

We take both open- and closed-source LLMs and LMMs into consideration. Such as GPT-4V, Gemini-Pro-Vision, Yi-VL-34B, DeepSeekMath-7B-RL. We evaluate the models in a zero-shot setting, and the prompt template for English and Chinese openended questions is shown as follows.

<p align="center"><img src="resources/prompt.png" style="width: 85%;"></p>

The key results are as follows:

<p align="center"><img src="resources/new_results.png" style="width: 85%;"></p>

Contact

If interested in our work, please contact us at:

Citation

BibTeX:

@misc{he2024olympiadbench,
      title={OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems}, 
      author={Chaoqun He and Renjie Luo and Yuzhuo Bai and Shengding Hu and Zhen Leng Thai and Junhao Shen and Jinyi Hu and Xu Han and Yujie Huang and Yuxiang Zhang and Jie Liu and Lei Qi and Zhiyuan Liu and Maosong Sun},
      year={2024},
      eprint={2402.14008},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}