Home

Awesome

<div align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="assets/Evalverse_White.png" width=300> <source media="(prefers-color-scheme: light)" srcset="assets/Evalverse_Color.png" width=300> <img alt="Evalverse" src="assets/Evalverse_Color.png" width=300> </picture>

The Universe of Evaluation. All about the evaluation for LLMs. </br> Upstage Solar is powered by Evalverse! Try at Upstage Console!

πŸ€—HugginFace Space β€’ πŸ“šDocs β€’ πŸ“„Paper

Examples β€’ FAQ β€’ Contribution Guide β€’ Contact β€’ Discord

</div>

πŸš€ Newly updated

<div align="center"><img alt="overview" src="assets/overview.png" width=500></div>

πŸ‘‹ Welcome to Evalverse!

Evalverse is a freely accessible, open-source project designed to support your LLM (Large Language Model) evaluation needs. We provide a simple, standardized, and user-friendly solution for the processing and management of LLM evaluations, catering to the needs of AI research engineers and scientists. We also support no-code evaluation processes for people who may have less experience working with LLMs. Moreover, you will receive a well-organized report with figures summarizing the evaluation results.

With Evalverse, you are empowered to

Architecture of Evalverse

<div align="center"><img alt="architecture" src="assets/architecture.png" width=700></div>

Key Features of Evalverse

If you want to know more about Evalverse, please checkout our docs. </br> By clicking below image, it'll take you to a short intro video! Brief Introduction </br>

🌌 Installation

🌠 Option 1: Git clone

Before cloning, please make sure you've registered proper SSH keys linked to your GitHub account.

1. Clone the Evalverse repository

git clone --recursive https://github.com/UpstageAI/evalverse.git

2. Install requirement packages

cd evalverse
pip install -e .

🌠 Option 2: Install via Pypi (WIP)

Currently, installation via Pypi is not supported. Please install Evalverse with option 1.

</br>

🌌 Configuration

You have to set an API key and/or Token in the .env file (rename .env_sample to .env) to use all features of Evalverse.

OPENAI_API_KEY=sk-...

SLACK_BOT_TOKEN=xoxb-...
SLACK_APP_TOKEN=xapp-...
</br>

🌌 Quickstart

More detailed tutorials are here.

🌠 Evaluation

πŸ’« Evaluation with Library

The following code is a simple example to evaluate the SOLAR-10.7B-Instruct-v1.0 model on the h6_en (Open LLM Leaderboard) benchmark.

import evalverse as ev

evaluator = ev.Evaluator()

model = "upstage/SOLAR-10.7B-Instruct-v1.0"
benchmark = "h6_en"

evaluator.run(model=model, benchmark=benchmark)

πŸ’« Evaluation with CLI

Here is a CLI script that produces the same result as the above code:

cd evalverse

python3 evaluator.py \
  --h6_en \
  --ckpt_path upstage/SOLAR-10.7B-Instruct-v1.0

🌠 Report

Currently, generating a report is only available through the library. We will work on a Command Line Interface (CLI) version as soon as possible.

import evalverse as ev

db_path = "./db"
output_path = "./results"
reporter = ev.Reporter(db_path=db_path, output_path=output_path)

reporter.update_db(save=True)

model_list = ["SOLAR-10.7B-Instruct-v1.0", "Llama-2-7b-chat-hf"]
benchmark_list = ["h6_en"]
reporter.run(model_list=model_list, benchmark_list=benchmark_list)
<img alt="architecture" src="assets/sample_report.png" width=700>
ModelRankingtotal_avgH6-ARCH6-HellaswagH6-MMLUH6-TruthfulQAH6-WinograndeH6-GSM8k
SOLAR-10.7B-Instruct-v1.0174.6271.3388.1965.5271.7283.1967.78
Llama-2-7b-chat-hf253.5153.1678.5947.3845.3172.6923.96
</br>

🌌 Supported Evaluations

We currently support four evaluation methods. If you have suggestions for new methods, we welcome your input!

EvaluationOriginal Repository
H6 (Open LLM Leaderboard)EleutherAI/lm-evaluation-harness
MT-benchlm-sys/FastChat
IFEvalgoogle-research/instruction_following_eval
EQ-BenchEQ-bench/EQ-Bench
</br>

🌌 Evalverse use-case

If you have any use-cases of your own, please feel free to let us know. </br>We would love to hear about them and possibly feature your case.

✨ Upstage is using Evalverse for evaluating Solar. </br> ✨ Upstage is using Evalverse for evaluating models at Open Ko-LLM Leaderboard.

</br>

🌌 Contributors

<a href="https://github.com/UpstageAI/evalverse/graphs/contributors"> <img src="https://contrib.rocks/image?repo=UpstageAI/evalverse"/> </a>

🌌 Acknowledgements

Evalverse is an open-source project orchestrated by the Data-Centric LLM Team at Upstage, designed as an ecosystem for LLM evaluation. Launched in April 2024, this initiative stands at the forefront of advancing evaluation handling in the realm of large language models (LLMs).

🌌 License

Evalverse is completely freely-accessible open-source and licensed under the Apache License 2.0.

🌌 Citation

If you want to cite our 🌌 Evalverse project, feel free to use the following bibtex. You can check our paper via link.

@misc{kim2024evalverse,
      title={Evalverse: Unified and Accessible Library for Large Language Model Evaluation}, 
      author={Jihoo Kim and Wonho Song and Dahyun Kim and Yunsu Kim and Yungi Kim and Chanjun Park},
      year={2024},
      eprint={2404.00943},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}