Home

Awesome

The LLM Evaluation guidebook ⚖️

If you've ever wondered how to make sure an LLM performs well on your specific task, this guide is for you! It covers the different ways you can evaluate a model, guides on designing your own evaluations, and tips and tricks from practical experience.

Whether working with production models, a researcher or a hobbyist, I hope you'll find what you need; and if not, open an issue (to suggest ameliorations or missing resources) and I'll complete the guide!

How to read this guide

In text, links prefixed by ⭐ are links I really enjoyed and recommend reading.

Table of contents

If you want an intro on the topic, you can read this blog on how and why we do evaluation!

Automatic benchmarks

Human evaluation

LLM-as-a-judge

Troubleshooting

The most densely practical part of this guide.

General knowledge

These are mostly beginner guides to LLM basics, but will still contain some tips and cool references! If you're an advanced user, I suggest skimming to the Going further sections.

Examples

You'll also find examples as jupyter notebooks, to get a more hands on experience of evaluation if that's how you learn!

Planned next articles

Resources

Links I like

Thanks

This guide has been heavily inspired by the ML Engineering Guidebook by Stas Bekman! Thanks for this cool resource!

Many thanks also to all the people who inspired this guide through discussions either at events or online, notably and not limited to:

Citation

CC BY-NC-SA 4.0

@misc{fourrier2024evaluation,
  author = {Clémentine Fourrier and The Hugging Face Community},
  title = {LLM Evaluation Guidebook},
  year = {2024},
  journal = {GitHub repository},
  url = {https://github.com/huggingface/evaluation-guidebook)
}