Home

Awesome

<div align="center">

Tests Quality Python versions License Status Version

</div>

LightEval 🌤️

A lightweight framework for LLM evaluation

Context

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.

We're releasing it with the community in the spirit of building in the open.

Note that it is still very much early so don't expect 100% stability ^^' In case of problems or questions, feel free to open an issue!

Installation

Clone the repo:

git clone https://github.com/huggingface/lighteval.git
cd lighteval

Create a virtual environment using virtualenv or conda depending on your preferences. We require Python 3.10 or above:

conda create -n lighteval python=3.10 && conda activate lighteval

Install the dependencies. For the default installation, you just need:

pip install .

If you want to evaluate models with frameworks like accelerate or peft, you will need to specify the optional dependencies group that fits your use case (accelerate,tgi,optimum,quantization,adapters,nanotron,tensorboardX):

pip install '.[optional1,optional2]'

The setup tested most is:

pip install '.[accelerate,quantization,adapters]'

If you want to push your results to the Hugging Face Hub, don't forget to add your access token to the environment variable HF_TOKEN. You can do this by running:

huggingface-cli login

and pasting your access token.

Optional steps

Lastly, if you intend to push to the code base, you'll need to install the precommit hook for styling tests:

pip install .[dev]
pre-commit install

Usage

We provide two main entry points to evaluate models:

For most users, we recommend using the 🤗 Accelerate backend - see below for specific commands.

Evaluate a model on one or more GPUs (recommended)

To evaluate a model on one or more GPUs, first create a multi-gpu config by running:

accelerate config

You can then evaluate a model using data parallelism as follows:

accelerate launch --multi_gpu --num_processes=<num_gpus> -m \
    lighteval accelerate \
    --model_args="pretrained=<path to model on the hub>" \
    --tasks <task parameters> \
    --output_dir output_dir

Here, --tasks refers to either a comma-separated list of supported tasks from the metadata table in the format:

suite|task|num_few_shot|{0 or 1 to automatically reduce `num_few_shot` if prompt is too long}

or a file path like examples/tasks/recommended_set.txt which specifies multiple task configurations. For example, to evaluate GPT-2 on the Truthful QA benchmark run:

accelerate launch --multi_gpu --num_processes=8 -m \
    lighteval accelerate \
    --model_args "pretrained=gpt2" \
    --tasks "lighteval|truthfulqa:mc|0|0" \
    --override_batch_size 1 \
    --output_dir="./evals/"

Here, --override_batch_size defines the batch size per device, so the effective batch size will be override_batch_size x num_gpus. To evaluate on multiple benchmarks, separate each task configuration with a comma, e.g.

accelerate launch --multi_gpu --num_processes=8 -m \
    lighteval accelerate \
    --model_args "pretrained=gpt2" \
    --tasks "leaderboard|truthfulqa:mc|0|0,leaderboard|gsm8k|0|0" \
    --override_batch_size 1 \
    --output_dir="./evals/"

See the examples/tasks/recommended_set.txt file for a list of recommended task configurations.

Evaluating a model with a complex configuration

If you want to evaluate a model by spinning up inference endpoints, use adapter/delta weights, or more complex configuration options, you can load models using a configuration file. This is done as follows:

accelerate launch --multi_gpu --num_processes=<num_gpus> -m \
    lighteval accelerate \
    --model_config_path="<path to your model configuration>" \
    --tasks <task parameters> \
    --output_dir output_dir

You can find the template of the expected model configuration in examples/model_configs/base_model.yaml_.

Evaluating a large model with pipeline parallelism

To evaluate models larger that ~40B parameters in 16-bit precision, you will need to shard the model across multiple GPUs to fit it in VRAM. You can do this by passing model_parallel=True and adapting --num_processes to be the number of processes to use for data parallel. For example, on a single node of 8 GPUs, you can run:

# PP=2, DP=4 - good for models < 70B params
accelerate launch --multi_gpu --num_processes=4 -m \
    lighteval accelerate \
    --model_args="pretrained=<path to model on the hub>,model_parallel=True" \
    --tasks <task parameters> \
    --output_dir output_dir

# PP=4, DP=2 - good for huge models >= 70B params
accelerate launch --multi_gpu --num_processes=2 -m \
    lighteval accelerate \
    --model_args="pretrained=<path to model on the hub>,model_parallel=True" \
    --tasks <task parameters> \
    --output_dir output_dir

Evaluate a model on the Open LLM Leaderboard benchmarks

To evaluate a model on all the benchmarks of the Open LLM Leaderboard using a single node of 8 GPUs, run:

accelerate launch --multi_gpu --num_processes=8 -m \
    lighteval accelerate \
    --model_args "pretrained=<model name>" \
    --tasks examples/tasks/open_llm_leaderboard_tasks.txt \
    --override_batch_size 1 \
    --output_dir="./evals/"

Evaluate a model on CPU

You can also use lighteval to evaluate models on CPU, although note this will typically be very slow for large models. To do so, run:

lighteval accelerate \
    --model_args="pretrained=<path to model on the hub>"\
    --tasks <task parameters> \
    --output_dir output_dir

Evaluate the model on a server/container.

An alternative to launching the evaluation locally is to serve the model on a TGI-compatible server/container and then run the evaluation by sending requests to the server. The command is the same as before, except you specify a path to a yaml config file (detailed below):

python run_evals_accelerate.py \
    --model_config_path="/path/to/config/file"\
    --tasks <task parameters> \
    --output_dir output_dir

There are two types of configuration files that can be provided for running on the server:

  1. endpoint_model.yaml: This configuration allows you to launch the model using HuggingFace's Inference Endpoints. You can specify in the configuration file all the relevant parameters, and then lighteval will automatically deploy the endpoint, run the evaluation, and finally delete the endpoint (unless you specify an endpoint that was already launched, in which case the endpoint won't be deleted afterwards).

  2. tgi_model.yaml: This configuration lets you specify the URL of a model running in a TGI container, such as one deployed on HuggingFace's serverless inference.

Templates for these configurations can be found in examples/model_configs.

Evaluate a model on extended, community, or custom tasks.

Independently of the default tasks provided in lighteval that you will find in the tasks_table.jsonl file, you can use lighteval to evaluate models on tasks that require special processing (or have been added by the community). These tasks have their own evaluation suites and are defined as follows:

For example, to run an extended task like ifeval, you can run:

lighteval accelerate \
    --model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \
    --use_chat_template \ # optional, if you want to run the evaluation with the chat template
    --tasks "extended|ifeval|0|0" \
    --output_dir "./evals"

To run a community or custom task, you can use (note the custom_tasks flag):

lighteval accelerate \
    --model_args="pretrained=<path to model on the hub>"\
    --tasks <task parameters> \
    --custom_tasks <path to your custom or community task> \
    --output_dir output_dir

For example, to launch lighteval on arabic_mmlu:abstract_algebra for HuggingFaceH4/zephyr-7b-beta, run:

lighteval accelerate \
    --model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \
    --use_chat_template \ # optional, if you want to run the evaluation with the chat template
    --tasks "community|arabic_mmlu:abstract_algebra|5|1" \
    --custom_tasks "community_tasks/arabic_evals" \
    --output_dir "./evals"

Using the dummy model

To debug or obtain random baseline scores for a given set of tasks, you can use the dummy model:

python run_evals_accelerate.py \
    --model_args "dummy"\
    --tasks <task parameters> \
    --output_dir output_dir

This "model" randomly generates logprobs (for selection/accuracy tasks) and the string "random baseline" for generation tasks. You can also select a specific seed for the random logprob values generated by the dummy model: --model_args "dummy,seed=123".

Deep thanks

lighteval was originally built on top of the great Eleuther AI Harness (we use the latter to power the Open LLM Leaderboard). We also took a lot of inspiration from the amazing HELM, notably for metrics.

Through adding more and more logging functionalities, and making it compatible with increasingly different workflows and model codebases (including 3D parallelism) as well as allowing custom evaluation experiments, metrics and benchmarks, we ended up needing to change the code more and more deeply until lighteval became the small standalone library that it is now.

However, we are very grateful to the Harness and HELM teams for their continued work on better evaluations.

How to navigate this project

lighteval is supposed to be used as a standalone evaluation library.

Customization

If your new task or metric has requirements, add a specific requirements.txt file with your evaluation.

Adding a new task

To add a new task, first either open an issue, to determine whether it will be integrated in the core evaluations of lighteval, in the extended tasks, or the community tasks, and add its dataset on the hub.

A popular community evaluation can move to become an extended or core evaluation over time.

Core evaluations

Prompt function: find a suitable prompt function in src.lighteval.tasks.task_prompt_formatting.py, or code your own. This function must output a Doc object, which should contain the query, your prompt, and either gold, the gold output, or choices and gold_index, the list of choices and index or indices of correct answers. If your query contains an instruction that should not be repeated in a few shot setup, add it to an instruction field.

Summary: create a LightevalTaskConfig summary of your evaluation, in src/lighteval/tasks/default_tasks.py. This summary should contain the following fields:

Make sure you can launch your model with your new task using --tasks lighteval|yournewtask|2|0.

Community evaluations

Copy the community_tasks/_template.py to community_tasks/yourevalname.py and edit it to add your custom tasks (the parameters you can use are explained above). It contains an interesting mechanism if the dataset you are adding contains a lot of subsets.

Make sure you can launch your model with your new task using --tasks community|yournewtask|2|0 --custom_tasks community_tasks/yourevalname.py.

Adding a new metric

First, check if you can use one of the parametrized functions in src.lighteval.metrics.metrics_corpus or src.lighteval.metrics.metrics_sample.

If not, you can use the custom_task system to register your new metric:

from aenum import extend_enum
from lighteval.metrics import Metrics

# And any other class you might need to redefine your specific metric, depending on whether it's a sample or corpus metric.
# Adds the metric to the metric list!
extend_enum(Metrics, "metric_name", metric_function)
if __name__ == "__main__":
    print("Imported metric")

You can then give your custom metric to lighteval by using --custom-tasks path_to_your_file when launching it.

To see an example of a custom metric added along with a custom task, look at examples/tasks/custom_tasks_with_custom_metrics/ifeval/ifeval.py.

Available metrics

Metrics for multiple choice tasks

These metrics use log-likelihood of the different possible targets.

All these metrics also exist in a "single token" version (loglikelihood_acc_single_token, loglikelihood_acc_norm_single_token, loglikelihood_f1_single_token, mcc_single_token, recall@2_single_token and mrr_single_token). When the multichoice option compares only one token (ex: "A" vs "B" vs "C" vs "D", or "yes" vs "no"), using these metrics in the single token version will divide the time spent by the number of choices. Single token evals also include:

Metrics for perplexity and language modeling

These metrics use log-likelihood of prompt.

Metrics for generative tasks

These metrics need the model to generate an output. They are therefore slower.

Metrics for specific tasks

To keep compatibility with the Harness for some specific tasks, we ported their evaluations more or less as such. They include drop (for the DROP dataset) and truthfulqa_mc_metrics (for TruthfulQA). In general, except for tasks where the dataset has very different formatting than usual (another language, programming language, math, ...), we want to use standard implementations of the above metrics. It makes little sense to have 10 different versions of an exact match depending on the task. However, most of the above metrics are parametrizable so that you can change the normalization applied easily for experimental purposes.

Not working yet

These metrics need both the generation and its logprob. They are not working at the moment, as this fn is not in the AI Harness.

Examples of scripts to launch lighteval on the cluster

Evaluate a whole suite on one node, 8 GPUs

  1. Create a config file for accelerate
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
  1. Create a slurm file
#!/bin/bash
#SBATCH --job-name=kirby-one-node
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:8
#SBATCH --mem-per-cpu=11G # This is essentially 1.1T / 96
#SBATCH --partition=production-cluster
#SBATCH --mail-type=ALL
#SBATCH --mail-user=clementine@huggingface.co

set -x -e
export TMPDIR=/scratch

echo "START TIME: $(date)"

# Activate your relevant virtualenv
source <path_to_your_venv>/activate #or conda activate yourenv

cd <path_to_your_lighteval>/lighteval

export CUDA_LAUNCH_BLOCKING=1
srun accelerate launch --multi_gpu --num_processes=8 -m lighteval accelerate --model_args "pretrained=your model name" --tasks examples/tasks/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=your output dir

Releases

Building the package

pip install build
python3 -m build .

Cite as

@misc{lighteval,
  author = {Fourrier, Clémentine and Habib, Nathan and Wolf, Thomas and Tunstall, Lewis},
  title = {LightEval: A lightweight framework for LLM evaluation},
  year = {2023},
  version = {0.3.0},
  url = {https://github.com/huggingface/lighteval}
}