Awesome
Llemma
evaluation harness
Fork of the Eleuther LM Evaluation Harness used in Azerbayev et al 2023.
Running the evaluation
See eval_scripts/generic_run.sh
for an entrypoint to running evaluation on a model from the HuggingFace Hub.
The script shows the set of non-theorem-proving tasks:
SYMBOLIC=minerva_math*,gsm8k,ocw_courses
MUL_CHOICE=minerva-hendrycksTest*,math_sat_cot
TOOLS=sympy_math*,python_gsm8k
Refer to lm_eval/tasks
directory for their associated implementations.
Theorem proving task
The informal-to-formal theorem proving task is kept in the minif2f-isabelle
branch. Please see the README in this branch for further instructions.
Additions
This Llemma
evaluation harness implemented several extensions of the Eleuther LM Evaluation Harness at the time of development.
Note that these may have been implemented in the Harness subsequently. An incomplete list includes:
- Support for
vLLM
- Saving generated sequences and metadata
- Majority voting (see
configs/majk.json
for example usage) - Temperature and top-p sampling
- Domain-specific evaluation (e.g. Sympy equivalence)
Tasks Supported
Below, we detail all evaluations implemented and reported in our paper.
math_sat_cot
: A small test set of SAT questions from the May 2023 College Board SAT examination, which occurred after the knowledge cutoff for Llemma's training set. Evaluated via chain-of-thought in natural language.hendrycks_math_ppl
: Perplexity evaluation on reference answers of sub-tasks of the MATH dataset.minif2f_isabelle
: Proof autoformalization in Isabelle on the miniF2F benchmark based on Draft-Sketch-Prove, with a Portal-to-Isabelle proof checker.minerva_math*
: The MATH benchmark with the prompt and Sympy evaluation from Minerva.minerva-hendrycksTest*
: MMLU-STEM tasks with prompting and chain-of-thought, following Minerva.ocw_courses
: The OCW Courses task from Minerva.python_gsm8k
: GSM8k solved by writing Python programs that return the numeric answer, based on PAL.sympy_math
: MATH evaluation, with Sympy or Pythonmath
modules used to write a programmatic solution.
We additionally implement the following tasks in this fork, though we do not report them in our paper due to time+space limitations:
lila_*
- Evaluation on the Lila dataset. Note that this requires executing model-generated code.proofnet*
- Evaluation on the ProofNet dataset for both auto- and in- formalization. Informalization requires GPT-3.5 evaluation and an OpenAI API key.
Quick Replication Instructions
Maj@1
To run the model on desired tasks with 1 attempt, run the following sample command with your model.
MODEL=EleutherAI/llemma_7b # your HF Hub model path here
TASK=minerva_math* # select tasks as desired. This codebase supports wildcard task names.
OUT=</path/to/save/outputs>
python main.py --no_cache --model vllm --model_args pretrained=${MODEL} --tasks $TASK --output_path ${OUT} --tp_degree ${TP_DEGREE}
Maj@K
To replicate Maj@K task results, additionally pass --description_dict_path configs/majk.json
to run majority voting with K attempts.
MODEL=EleutherAI/llemma_7b # your HF Hub model path here
TASK=minerva_math* # select tasks as desired. This codebase supports wildcard task names.
OUT=</path/to/save/outputs>
python main.py --no_cache --model vllm --model_args pretrained=${MODEL} --tasks $TASK --output_path ${OUT} --tp_degree ${TP_DEGREE} --description_dict_path ${HARNESS_DIR}/configs/majk.json
TP_DEGREE
can be set as needed to determine how many GPUs will be used by vLLM.
Be sure to set $OUT to the desired save location for scores and model output text.
Answer Checking + Scoring
Due to heavy CPU burden, we do not calculate metrics for tasks like minerva_math
that rely on checking correctness via SymPy equivalence, or tasks like sympy_math
or python_gsm
that require execution of model-generated Python code.
After running the model on one of these tasks, we provide utilities to perform answer checking.
Note that SymPy answer checking can be quite heavy on CPU resources and time-consuming for Maj@K at high K.
:rotating_light: WARNING: unsafe_score_sympy_math.py
and unsafe_score_python_gsm.py
will execute model-written Python code! Please use in a sandbox and at your own risk.
:rotating_light: WARNING: scoring scripts modify eval-harness output JSONs in-place. Back up your results files and use with caution!
To score sympy_math
outputs, run:
python unsafe_score_sympy_math.py --output <path-to-results-with-sympy_math>
To score python_gsm8k
outputs, run
python unsafe_score_python_gsm.py --output <path-to-results-with-python_gsm8k>
These scripts will take in a results file, read in the LM's generated programs, and execute them to check for correctness. It will then incorporate the per-sample and full-task accuracies into the results file and rewrite the entire file with these values added.
All scripts allow for a --limit X
flag to be passed to only score the first X documents.
Due to the high resource cost of scoring MATH with SymPy, unsafe_score_minerva_math.py
has additional requirements.
To run Sympy scoring using multiprocessing, run
python unsafe_score_minerva_math.py --output <path-to-math-result.json>
To run in a single process, run
python unsafe_score_minerva_math.py --output <path-to-math-result.json> --no_multiprocessing
Additionally, MATH scoring with SymPy is resumable--results and pass rates / accuracies for each document are saved to the results file in-place. by default, the script will not rescore already-scored documents.
Aggregation
Finally, we provide utilities for aggregating MMLU and MATH scores across subtasks, aggregating at the sample level rather than the subset level.
To aggregate MMLU-STEM scores, run:
python score_mmlu.py <path-to-mmlu-results.json>
To aggregate MATH scores, run:
python score_math.py <path-to-math-subtask-1.json>,<path-to-math-subtask-2-and-3.json>,...
Citation
Please cite the Llemma paper if you find code from this fork useful to your work:
@article{
}
Please cite the Eleuther LM Evaluation Harness using:
@software{eval-harness,
author = {Gao, Leo and
Tow, Jonathan and
Biderman, Stella and
Black, Sid and
DiPofi, Anthony and
Foster, Charles and
Golding, Laurence and
Hsu, Jeffrey and
McDonell, Kyle and
Muennighoff, Niklas and
Phang, Jason and
Reynolds, Laria and
Tang, Eric and
Thite, Anish and
Wang, Ben and
Wang, Kevin and
Zou, Andy},
title = {A framework for few-shot language model evaluation},
month = sep,
year = 2021,
publisher = {Zenodo},
version = {v0.0.1},
doi = {10.5281/zenodo.5371628},
url = {https://doi.org/10.5281/zenodo.5371628}
}