Home

Awesome

<div align="center"> <h1>🧐 LLM AutoEval</h1> <p> 🐦 <a href="https://twitter.com/maximelabonne">Follow me on X</a> • 🤗 <a href="https://huggingface.co/mlabonne">Hugging Face</a> • 💻 <a href="https://mlabonne.github.io/blog">Blog</a> • 📙 <a href="https://github.com/PacktPublishing/Hands-On-Graph-Neural-Networks-Using-Python">Hands-on GNN</a> </p> <p><em>Simplify LLM evaluation using a convenient Colab notebook.</em></p> <a href="https://colab.research.google.com/drive/1Igs3WZuXAIv9X0vwqiE90QlEPys8e8Oa?usp=sharing"><img src="img/colab.svg" alt="Open In Colab"></a></center> </div> <br/> <p align="center"> <img src='img/llmautoeval.png'> </p>

🔍 Overview

LLM AutoEval simplifies the process of evaluating LLMs using a convenient Colab notebook. You just need to specify the name of your model, a benchmark, a GPU, and press run!

Key Features

View a sample summary here.

Note: This project is in the early stages and primarily designed for personal use. Use it carefully and feel free to contribute.

⚡ Quick Start

Evaluation

Cloud GPU

Tokens

Tokens use Colab's Secrets tab. Create two secrets called "runpod" and "github" and add the corresponding tokens you can find as follows:

📊 Benchmark suites

Nous

You can compare your results with:

Lighteval

You can compare your results on a case-by-case basis, depending on the tasks you have selected.

Open LLM

You can compare your results with those listed on the Open LLM Leaderboard.

🏆 Leaderboard

I use the summaries produced by LLM AutoEval to created YALL - Yet Another LLM Leaderboard with plots as follows:

image

Let me know if you're interested in creating your own leaderboard with your gists in one click. This can be easily converted into a small notebook to create this space.

🛠️ Troubleshooting

Acknowledgements

Special thanks to burtenshaw for integrating lighteval, EleutherAI for the lm-evaluation-harness, dmahan93 for his fork that adds agieval to the lm-evaluation-harness, Hugging Face for the lighteval library, NousResearch and Teknium for the Nous benchmark suite, and vllm for the additional inference speed.