Home

Awesome

math401-llm

Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?

Main

Full evaluation of all size models.

Full

Dataset

MATH 401 = 1 Euler Equation + 16 group * 25 problems

Metric

Accuracy

If the difference between the decoded number and the target number is less than $1e-3$, we consider it a correct prediction. Accuracy is calculated based on correct prediction counts.

Relative error

We denote decoded number is $\hat{y}$ and target is $y$. We calculate relative error by:

$RE = \min(10, \frac{|\hat{y}-y|}{\max(|y|, 1)})$

If LLM does not decode any number, we consider $RE=10$. We truncate the relative error to 10 to prevent that one big mistake dominate the average relative error.

Non-number ratio

If decoded content does not contain any numbers, we consider it a failure. We calculate the non-number ratio based on it.

Citation

@misc{yuan2023large,
      title={How well do Large Language Models perform in Arithmetic tasks?}, 
      author={Zheng Yuan and Hongyi Yuan and Chuanqi Tan and Wei Wang and Songfang Huang},
      year={2023},
      eprint={2304.02015},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}