Awesome
<p align="center"> <br> <img src="https://huggingface.co/datasets/evaluate/media/resolve/main/evaluate-banner.png" width="400"/> <br> </p> <p align="center"> <a href="https://github.com/huggingface/evaluate/actions/workflows/ci.yml?query=branch%3Amain"> <img alt="Build" src="https://github.com/huggingface/evaluate/actions/workflows/ci.yml/badge.svg?branch=main"> </a> <a href="https://github.com/huggingface/evaluate/blob/master/LICENSE"> <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/evaluate.svg?color=blue"> </a> <a href="https://huggingface.co/docs/evaluate/index"> <img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/evaluate/index.svg?down_color=red&down_message=offline&up_message=online"> </a> <a href="https://github.com/huggingface/evaluate/releases"> <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/evaluate.svg"> </a> <a href="CODE_OF_CONDUCT.md"> <img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg"> </a> </p>Tip: For more recent evaluation approaches, for example for evaluating LLMs, we recommend our newer and more actively maintained library LightEval.
🤗 Evaluate is a library that makes evaluating and comparing models and reporting their performance easier and more standardized.
It currently contains:
- implementations of dozens of popular metrics: the existing metrics cover a variety of tasks spanning from NLP to Computer Vision, and include dataset-specific metrics for datasets. With a simple command like
accuracy = load("accuracy")
, get any of these metrics ready to use for evaluating a ML model in any framework (Numpy/Pandas/PyTorch/TensorFlow/JAX). - comparisons and measurements: comparisons are used to measure the difference between models and measurements are tools to evaluate datasets.
- an easy way of adding new evaluation modules to the 🤗 Hub: you can create new evaluation modules and push them to a dedicated Space in the 🤗 Hub with
evaluate-cli create [metric name]
, which allows you to see easily compare different metrics and their outputs for the same sets of references and predictions.
🔎 Find a metric, comparison, measurement on the Hub
🌟 Add a new evaluation module
🤗 Evaluate also has lots of useful features like:
- Type checking: the input types are checked to make sure that you are using the right input formats for each metric
- Metric cards: each metrics comes with a card that describes the values, limitations and their ranges, as well as providing examples of their usage and usefulness.
- Community metrics: Metrics live on the Hugging Face Hub and you can easily add your own metrics for your project or to collaborate with others.
Installation
With pip
🤗 Evaluate can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)
pip install evaluate
Usage
🤗 Evaluate's main methods are:
evaluate.list_evaluation_modules()
to list the available metrics, comparisons and measurementsevaluate.load(module_name, **kwargs)
to instantiate an evaluation moduleresults = module.compute(*kwargs)
to compute the result of an evaluation module
Adding a new evaluation module
First install the necessary dependencies to create a new metric with the following command:
pip install evaluate[template]
Then you can get started with the following command which will create a new folder for your metric and display the necessary steps:
evaluate-cli create "Awesome Metric"
See this step-by-step guide in the documentation for detailed instructions.
Credits
Thanks to @marella for letting us use the evaluate
namespace on PyPi previously used by his library.