Home

Awesome

ZeroEval: A Unified Framework for Evaluating Language Models

ZeroEval is a simple unified framework for evaluating (large) language models on various tasks. This repository aims to evaluate instruction-tuned LLMs for their zero-shot performance on various reasoning tasks such as MMLU and GSM. We evaluate LLMs with a unified setup by controlling the factors such as prompting, sampling, output parsing, etc. In ZeroEval, we perform zero-shot prompting, and instruct LM to output both reasoning and answer in a json-formatted output. We are actively adding new tasks. Contributions are welcome!

Todo

Installation

<details> <summary> Click to expand </summary>
conda create -n zeroeval python=3.10
conda activate zeroeval
# pip install vllm -U # pip install -e vllm 
pip install vllm -U
pip install -r requirements.txt
# export HF_HOME=/path/to/your/custom/cache_dir/ 
</details>

Tasks

<!-- - AlpacaEval (`-d alpaca-eval`) -->

Usage

zero_eval_local.sh and zero_eval_api.sh are the two main scripts to run the evaluation.

Examples

More examples can be found in the scripts folder, e.g., the scripts/_MMLU_redux.md and scripts/_GSM.md files as well as scripts/local/crux.sh.

Arguments

<details> <summary>Command Line Arguments</summary>
ArgumentsDescriptionDefault
-dDATA_NAME: mmlu-redux, gsm, math-l5, zebra-grid, alpaca_eval, ... (see src/task_configs.py)
-mmodel_name
-pmodel_pretty_name
-snumber of shards (When -s 1 we'll use all your GPUs for loading the model and running the inference; When -s K, we'll use K GPUs and divide the data into K shards for each GPU to run the inference on a single shard, and merge the results at the end.)1
-fengine (vllm by default for zero_eval_local.sh, can be changed to hf; For zero_eval_api.sh, we can use openai, anthropic, ...)vllm/openai for zero_eval_local/api.sh
-rrun_name (the results will be saved in a sub folder with the run_name when it is specified)"default"
-ttemperature0 (greedy decoding)
-otop_p for nucleus sampling1.0
-erepetition penalty1.0
-bbatch size4
-xmax_length4096
</details>

Results

🚨 View results on our Leaderboard: https://hf.co/spaces/allenai/ZeroEval

<!-- python src/evaluation/mcqa_eval.py mmlu-redux python src/evaluation/math_eval.py math-l5 python src/evaluation/zebra_grid_eval.py python src/evaluation/crux_eval.py python src/evaluation/summarize.py python src/evaluation/math_eval.py gsm --> <!-- ### Changelogs - 08/02/2024: added Gemini 1.5 Pro Exp 0801 and CRUX results - 07/31/2024: added Meta-Llama-3.1-70B-Instruct and gemma-2-2b-it - 07/29/2024: added Llama-3.1-8B, Mistral-Large-2, and deepseek-coder-v2-0724 -->

Citation

If you find ZeroEval useful, please cite it as follows in your publication:

@software{Lin_ZeroEval_A_Unified_2024,
    author = {Lin, Bill Yuchen},
    month = jul,
    title = {{ZeroEval: A Unified Framework for Evaluating Language Models}},
    url = {https://github.com/WildEval/ZeroEval},
    year = {2024}
}

Star History

Star History Chart

<!-- bash zero_eval_api.sh -f openai -d zebra-grid -m openai/o1-mini-2024-09-12 -p o1-mini-2024-09-12-v2 -s 4 wait bash zero_eval_api.sh -f openai -d zebra-grid -m openai/o1-preview-2024-09-12 -p o1-preview-2024-09-12-v2 -s 4 wait bash zero_eval_api.sh -d zebra-grid -f openai -m openai/gpt-4o-mini-2024-07-18 -p gpt-4o-mini-2024-07-18 -s 1 -n 32 -r "sampling" -t 0.5 -->