Home

Awesome

LLM Math Evaluation Harness

A unified, precise, and extensible toolkit to benchmark LLMs on various mathematical tasks 🧮✨.

🔴🚀 Important Notice: We've identified variances above 5% in results from diverse math evaluation frameworks. To ensure fair and standardized comparisons across research, our toolkit strives to harmonize evaluation methods, promoting consistent and reliable math evaluation.

🌟 In Practice: Esteemed projects like ToRA (ICLR'24) and DeepSeek-Coder have leveraged this suite!

Features:

🚀 Getting Started

⚙️ Environment Setup

Option 1: Conda

conda create -n math_eval python=3.10
conda activate math_eval

Option 2: Docker

We suggest using vLLM docker directly:

docker run --network host --cap-add=SYS_ADMIN --privileged -d \
    --entrypoint '' --name vllm \
    --runtime nvidia --gpus all \
    --security-opt apparmor:unconfined \
    --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -v /mnt:/mnt \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    sleep infinity

Install

git clone https://github.com/ZubinGou/math-evaluation-harness.git
cd math-evaluation-harness
pip install -r requirements.txt

⚖️ Evaluation

  1. Configure model and data settings in scripts/run_math_eval.sh, and set the PROMPT_TYPE variable accordingly:
    • For base models, choose from: direct, cot, pal, or tool-integrated.
    • For SFT models, your options include: tora, wizard_zs, deepseek-math, etc.
      • To add new models, update the construct_prompt function in utils.py to include your new prompt template.
  2. Run the script:
bash scripts/run_eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH

📊 Results

Base Models (CoT)

PROMPT_TYPE=cot

ModelSizeDataUniq. TokenTrain TokenGSM8KMATH1SVAMPASDivMAWPSTAB2MQAMMLU STEMSATAVG
1-2B Base Models
Tinyllama1.1B---2.93.211.018.120.412.514.616.121.913.4
Phi-1.51.3B---32.44.243.453.166.224.414.321.818.831.0
Qwen1.51.8B---36.16.848.563.679.029.225.131.340.640.0
Gemma2.0B---18.811.438.056.672.536.926.834.450.038.4
DeepSeekLLM1.3BOWM14B150B11.58.9-----29.631.3-
DeepSeekMath1.3B-120B150B23.813.6-----33.156.3-
Rho-Math1.1BOWM14B30B36.215.652.167.083.929.032.523.328.140.9
>= 7B Base Models
LLaMA-27B--14.03.639.551.763.530.912.432.734.431.4
Mistral7B--41.211.664.768.587.552.933.049.559.452.0
Minerva8B-39B164B16.214.1-----35.6--
Minerva62B-39B109B52.427.6-----53.9--
Minerva540B-39B26B58.833.6-----63.9--
LLemma7BPPile55B200B38.817.256.169.182.448.741.045.459.450.9
LLemma34BPPile55B50B54.223.067.975.790.157.049.854.768.860.1
Intern-Math7B-31B125B41.814.461.666.883.750.057.324.837.548.7
Intern-Math20B-31B125B65.430.075.779.394.050.938.553.171.962.1
DeepSeekMath7B-120B500B64.134.274.083.992.463.462.456.484.468.4
Rho-Math7BOWM14B10.5B66.931.077.879.093.949.958.754.684.466.2

SFT Models (Code Interpreter)

PROMPT_TYPE=tora

ModelSizeSFT DataGSM8kMATHSVAMPASDivMAWPSTABGSM-HardAVG
GPT4-early (PAL)--94.251.894.892.697.795.977.686.4
MAmmoTH70BMI-260k76.941.882.4-----
ToRA7BToRA-69k68.840.168.273.988.842.454.662.4
ToRA70BToRA-69k84.349.782.786.893.874.067.276.9
DeepSeekMath7BToRA-69k79.852.080.187.193.885.863.177.4
Rho-Math1BToRA-69k59.440.660.774.288.626.748.156.9
Rho-Math7BToRA-69k81.351.880.885.594.570.163.175.3

SFT Models (CoT)

PROMPT_TYPE=deepseek-math

SizeModelGSM8kMATHSWAMPASDivMAWPSAVG
7BDeepSeek-Math-Instruct82.445.883.590.195.779.5
DeepSeek-Math-RL88.350.087.292.095.582.6

🍀 Contributing

This project is still under active development. We welcome any contributions, including bug reports, feature requests, and pull requests.

☕️ References

Footnotes

  1. We suggest utilizing the OpenAI test subset for evaluating MATH performance, since the original MATH test set has already been included in public training sets such as PRM800k. We use minerva_math prompt.

  2. abbreviations: TAB=tabmwp, MQA = mathqa, SAT = sat_math