Home

Awesome

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

<div align="center"> <img src="https://github.com/open-compass/opencompass/assets/28834990/c285f051-f6cb-4425-8045-863bb94095ed" width="400"> <div> </div> <!-- <b><font size="3">MathBench</font></b> --> <div> </div> </div> <!-- [๐Ÿฐ[Project Page](https://github.com/open-compass/MathBench/)] [๐Ÿ“š[LeaderBoard](https://github.com/open-compass/MathBench/index.html)] --> <div align="center"> <!-- [๐Ÿฐ[Project Page](https://github.com/open-compass/MathBench/)] -->

[๐Ÿ“„Paper] [๐Ÿ“šLeaderBoard]

</div>

โ˜€๏ธIntroduction

MathBench is an All in One math dataset for language model evaluation, with:

<!-- CE utilizes a circular evaluation mechanism to mitigate the model's biased tendencies, such as consistently favoring option A or yielding entirely different results across multiple responses. During the evaluation of a multiple-choice question, CE performs several assessments. After each question-answer interaction, the order of the options is rearranged through a "circular" mechanism (for instance, ABCD becomes BCDA). A question is only deemed correct if all responses across these evaluations are accurate. Within MathBench, we employ CE-4, meaning each question undergoes four rounds of evaluation. -->

๐Ÿš€ What's New

๐ŸŒฒDataset Structure

<div align="center"> <img src="https://github.com/hpcaitech/ColossalAI/assets/28834990/866e88d6-4d4f-4e19-aadb-bcb047fffe76" width="800"/> </div>

๐Ÿ“’Model Performance

We use zero-shot CoT set for multiple-choice questions and few-shot (8) CoT set for all textual questions. The results are shown in the following table, we present the results with common Accuracy and Circular Evaluation (CE) metrics.

Here is the CE result of MathBench.

MathBench-A demonstrates the performance of the model on application problems

ModelsArithPrimaryMiddleHighCollegeAvg.
Closed-source Models
GPT-3.5-Turbo-012572.772.327.318.314.341.0
GLM461.780.055.738.720.751.3
GPT-4-0125-Preview76.082.359.041.335.358.8
Qwen-Max-042872.386.365.045.027.359.2
DeepSeek-V2-API82.789.359.039.329.359.9
Claude-3-Opus85.785.058.042.743.763.0
GPT-4o-2024-05-1377.787.776.359.054.070.9
Open-source Chat Models
~7B
Yi-6B-Chat35.336.37.03.04.317.2
ChatGLM3-6B38.041.013.75.31.719.9
DeepSeek-7B-Chat48.347.78.74.32.722.3
Qwen-7B-Chat50.750.722.09.36.027.7
InternLM2-Chat-7B52.066.330.013.78.734.1
Llama-3-8B-Instruct54.771.025.019.014.036.7
GLM-4-9B-Chat55.069.054.040.320.047.7
Yi-1.5-9B-Chat72.780.745.334.322.751.1
Qwen2-7B-Instruct69.780.753.338.325.053.4
10~34B
Baichuan2-13B-Chat40.044.713.74.71.720.9
Yi-34B-Chat50.762.023.014.77.731.6
Qwen-14B-Chat63.761.739.021.012.039.5
InternLM2-Chat-20B62.372.737.724.713.042.1
Yi-1.5-34B-Chat69.782.350.035.323.752.2
~70B
DeepSeek-67B-Chat62.072.733.321.312.040.3
Qwen-72B-Chat72.071.753.732.019.049.7
Llama-3-70B-Instruct70.386.053.038.734.056.4
Qwen1.5-110B-Chat70.382.364.047.328.058.4
Qwen2-72B-Instruct76.389.071.751.746.367.0
Mathematical Models
MammoTH-7B27.024.32.71.70.711.3
MammoTH-13B35.043.05.04.75.018.5
MammoTH-70B35.760.011.010.76.024.7
Metamath-Llemma-7B51.751.08.38.35.024.9
InternLM2-Chat-Math-7B53.767.041.318.38.037.7
DeepSeek-Math-7B-Instruct61.074.030.324.714.340.9
InternLM2-Chat-Math-20B58.770.043.724.712.741.9
DeepSeek-Math-7B-RL<u>68.0</u><u>83.3</u><u>44.3</u><u>33.0</u><u>23.0</u><u>50.3</u>

MathBench-T demonstrates the performance of the model on theoretical problems

ModelsPrimaryMiddleHighCollegeAvg.
Closed-source Models
GPT-3.5-Turbo-012570.156.747.352.556.7
GLM488.679.563.760.673.1
GPT-4-0125-Preview87.281.072.073.378.4
Claude-3-Opus86.079.072.677.478.7
DeepSeek-V2-API88.983.770.376.379.8
Qwen-Max-042890.483.273.474.880.4
GPT-4o-2024-05-1392.288.382.085.687.0
Open-source Chat Models
~7B
DeepSeek-7B-Chat33.326.014.413.621.8
ChatGLM3-6B41.632.420.212.026.6
Yi-6B-Chat48.033.521.823.931.8
Qwen-7B-Chat53.143.532.931.240.2
GLM-4-9B-Chat85.078.065.871.475.0
Llama-3-8B-Instruct60.251.343.553.652.1
InternLM2-Chat-7B67.355.845.442.752.8
Yi-1.5-9B-Chat81.674.062.769.872.0
Qwen2-7B-Instruct89.482.971.270.178.4
10~34B
Baichuan2-13B-Chat45.436.924.121.031.9
InternLM2-Chat-20B64.556.249.943.253.4
Yi-34B-Chat70.957.043.646.854.6
Qwen-14B-Chat71.664.049.749.458.7
Yi-1.5-34B-Chat86.380.869.373.277.4
~70B
DeepSeek-67B-Chat78.165.755.664.666.0
Llama-3-70B-Instruct71.464.362.171.267.2
Qwen-72B-Chat90.980.967.169.877.2
Qwen-1.5-110B-Chat93.485.076.581.584.1
Qwen2-72B-Instruct93.489.984.487.788.8
Mathematical Models
MammoTH-7B11.69.18.46.38.8
MammoTH-13B27.518.615.017.119.5
MetaMath-Llemma-7B36.633.528.825.931.2
MammoTH-70B58.147.139.344.647.3
InternLM2-Chat-Math-7B65.660.251.746.556.0
DeepSeek-Math-7B-Instruct73.358.449.350.357.8
InternLM2-Chat-Math-20B73.270.560.653.064.3
DeepSeek-Math-7B-RL<u>79.6</u><u>72.0</u><u>61.3</u><u>68.7</u><u>70.4</u>

๐Ÿ”ŠAverage Application Scores with Stages

Models exhibit similar performances in Arithmetic and Primary stages, while demonstrating a clear performance decline from Primary to College stages.

<div align="center"> <img src="https://github.com/open-compass/MathBench/assets/28834990/f7d83014-f4c1-45d5-bf3b-386c95c032f9" width="800"/> </div>

Bilingual Performance

๐Ÿ“ŠModel Size vs. Average Score

The comparison chart of model parameter size versus performance on MathBench for selected representative models, with models from the same series connected by lines of the same color. The horizontal red dotted line represents the score of GPT-4-0125-Preview.

<div align="center"> <img src="https://github.com/open-compass/opencompass/assets/28834990/f00ec39b-5c8f-4990-82fc-7fca826c3c64" width="800"/> </div>

๐Ÿ“ˆBilingual Performance

Below is a comparison of the bilingual results of numerous Chat models on MathBench, sorted in ascending order of language average scores. image

๐Ÿ–‹Inference MathBench with OpenCompass

OpenCompass is a toolkit for evaluating the performance of large language models (LLMs). There are steps for inference MathBench with OpenCompass:

  1. Install OpenCompass
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
  1. Prepare the dataset, you can download the data from release file
# Download dataset from release file and copy to data/ folder
mkdir data
cp -rf mathbench_v1 ./data/ 
  1. Inference MathBench
# Inference MathBench with hf_llama2_7b_chat model
python run.py --models hf_llama2_7b_chat --datasets mathbench_gen

You can also evaluate HuggingFace models via command line.

python run.py --datasets mathbench_gen \
--hf-path meta-llama/Llama-2-7b-chat-hf \  # HuggingFace model path
--model-kwargs device_map='auto' \  # Arguments for model construction
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \  # Arguments for tokenizer construction
--max-seq-len 2048 \  # Maximum sequence length the model can accept
--batch-size 8 \  # Batch size
--no-batch-padding \  # Don't enable batch padding, infer through for loop to avoid performance loss
--num-gpus 1  # Number of minimum required GPUs
--summarizer summarizers.groups.mathbench_v1_2024 # Summarizer for MathBench-T and MathBench-A

If you want to see bilingual results for MathBench-A&T, replace summarizers.groups.mathbench_v1_2024 with summarizers.groups.mathbench_v1_2024_lang. To access detailed results for each sub-dataset, use summarizers.mathbench_v1.

You can use the -r command to reuse predictions and display different results when changing the summarizer.

Citation and Tech Report

If you use MathBench in your research, please cite the following paper:

@misc{liu2024mathbench,
      title={MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark}, 
      author={Hongwei Liu and Zilong Zheng and Yuxuan Qiao and Haodong Duan and Zhiwei Fei and Fengzhe Zhou and Wenwei Zhang and Songyang Zhang and Dahua Lin and Kai Chen},
      year={2024},
      eprint={2405.12209},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}