Home

Awesome

<div align="center"> <img src="./static/assets/th1-icon-round.svg" width="100px"/> <br /> <br /> </div>

Welcome to LLM Benchmarker Suite!

Navigating the complex landscape of evaluating large-scale language models (LLMs) has never been more important. As the demand for cutting-edge language AI continues to grow, the need for a comprehensive and optimized approach to assessing their underlying models becomes more and more apparent.

To do this, we combine best practice with our fine-tuned optimizations to achieve a one-stop-shop approach that provides a way to holistically evaluate large language models.

To use learn how to use the tools directly jump to Tools Overview

<a id="table-of-contents"></a> Table of Contents

<a id="introduction"></a> Introduction

In recent years, the field of natural language processing (NLP) has witnessed an unprecedented surge in innovation, largely fueled by the advancements in large language models (LLMs) such as GPT, BERT, and their derivatives. These models exhibit an impressive ability to understand and generate human-like text, revolutionizing various applications such as content generation, sentiment analysis, language translation, and more. As their potential continues to unfold, researchers and practitioners alike are increasingly invested in benchmarking the performance of these LLMs to better comprehend their capabilities and limitations.

Despite the widespread interest in benchmarking LLMs, the process itself has remained remarkably non-standardized and often lacks a coherent framework. This lack of standardization introduces significant challenges, making it difficult to compare results across studies, impeding the reproducibility of findings and hindering the overall progress of the field. Consequently, a pressing need arises to establish a comprehensive and standardized approach to benchmarking LLMs, ensuring that evaluations are meaningful, consistent, and comparable.

Motivated by the need for unified benchmarking of large language models, we introduce the LLM Benchmarking Suite. This open-source effort aims to address fragmentation and ambiguity in LLM benchmarking. The suite provides a structured methodology, a collection of diverse benchmarks, and toolkits to streamline assessing LLM performance. By offering a common platform, this project seeks to promote collaboration, transparency, and quality research in NLP.

<a id="building-blocks-of-llm-evaluation"></a> Building Blocks of LLM Evaluation

For manually conducting evaluations, you generally take the following steps:

<a id="load-the-model"></a> Load the Model

Loading models locally enables efficient and rapid inferences by eliminating the delays associated with remote network-based inferences, which can be particularly sluggish for sizable models. Local loading optimally harnesses local hardware resources such as GPUs, ensuring swift and accelerated inferences. Among the available libraries facilitating local model loading, vLLM stands out for its exceptional ability to maximize resource utilization and attain peak throughput.

vLLM is a library that allows for fast inference on large language models. It is a fork of the popular HuggingFace Transformers library. It has a feature called paged attention - vLLM which can be seen as a buffer and shared memory feature that can be used to drastically increase the response time for a model improving the number of tokens the model can generate in response each second. Overall it gives a significant speedup over directly using the logic of from_pretrained from AutoClasses of HuggingFace which most benchmarking tools directly use.

<a id="load-an-appropriate-benchmarking-dataset"></a> Load an appropriate benchmarking dataset

Datasets mainly fall under multiple categories which aim to test proficiency of a model at a particular task such as question answering or summarization such as Squad and RACE. Hybrid benchmarks such as SuperGlue and LAMBADA also include a variety of test to get a more holistic understanding of the a large language model's capabilities. The popularity and relevance of a dataset can be gauged from what the latest foundational models use such as:

We also prefer minimally curated datasets to ensure that the dataset is not biased towards a particular model.

image This visualization from Google shows how the complexity and score of an LLM can grow within different areas with the number of parameters

<a id="select-a-relevant-metric-for-evaluation"></a> Select a relevant metric for evaluation

The most common metrics used for evaluation are:

<a id="leaderboard"></a> Leaderboard

We provide LLM Benchmarker Suite Leaderboard for the community to rank all public models and API models. If you would like to join the evaluation, please provide the model repository URL or a standard API interface to the email address abhijoy.sarkar@theoremone.co.

To access the locally created metrics dashboard, initiate a server by navigating to the static directory and running the command python3 -m http.server 8000. Afterward, open your web browser and visit http://<Host IP>:8000/. Alternatively, you can compare your evaluations with ours on the LLM Benchmarker Suite Leaderboard.

ModelMMLUTriviaQANatural QuestionsGSM8KHumanEvalAGIEvalBoolQHellaSWAGOpenBookQAQuACWinogrande
MPT (7B)26.859.617.86.818.323.57576.451.437.768.3
Falcon (7B)26.256.818.16.8nan21.267.574.151.618.866.3
LLaMA-2 (7B)45.368.922.714.612.829.377.477.258.639.769.2
Llama-2 (13B)54.877.22828.718.339.181.780.75744.872.8
MPT (30B)46.971.32315.22533.87979.95241.171
Falcon (40B)55.478.629.519.6nan3783.183.656.643.376.9
LLaMA-1 (65B)63.484.53150.923.747.685.384.260.239.877
LLaMA-2 (70B)68.9853356.829.954.28585.360.249.380.2

A quick look at the above table reveals several interesting observations:

We can draw the following conclusions from these observations:

  1. Model Size Impact: Larger models consistently demonstrate improved performance across most benchmarks. Llama-2 70B particularly stands out, securing the highest scores on 10 out of the 11 benchmarks. This suggests that model size plays a crucial role in achieving superior results across a range of tasks.
  2. Size-Performance Exceptions: Despite the general trend, there are instances where smaller models outshine their larger counterparts. Notably, Llama-1 65B performs better than Llama-2 70B on BoolQ. This intriguing phenomenon raises the question of whether task-specific nuances contribute to these exceptions.
  3. Benchmark-Specific Trends: The relative performance of models varies significantly across different benchmarks. For instance, on MMLU, Llama-2 70B demonstrates exceptional supremacy, while on BoolQ, models exhibit closely competitive performance. This observation indicates that a model's effectiveness is contingent upon the specific characteristics of the task at hand. For example, BoolQ represents a comparatively straightforward dataset, while MMLU assesses the yes/no question answering proficiency of LLMs, demanding less intricate language comprehension.
  4. Performance Gains with Model Size: The gains achieved by increasing model size are not uniform across benchmarks. For instance, the transition from MPT 7B to MPT 30B yields substantial improvements on MMLU, whereas the gains are relatively smaller on BoolQ. This implies that the relationship between model size and performance enhancement is intricate and possibly task-dependent.
  5. Trade-offs and Task Complexity: The contrasting performance improvements from model size expansion indicate that there might be a trade-off between model complexity and task specificity. Smaller models might excel in tasks with well-defined patterns, while larger models could excel in more complex, nuanced tasks that require broader context understanding.

<a id="tools-overview"></a> Tools Overview

There are many packages that assist in evaluation of Large Language Models (LLMs). We take the best practices available along with our own optimizations to create one-stop method to evaluate LLMs holistically aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features include:

Get Started

<a id="prerequisites"></a> Pre-requisites

The LLM Benchmarking Suite will NOT run if CUDA toolkit is not configured for your machine/cloud instance.

Run the following to command in a linux machine to check CUDA toolkit and cuDNN is correctly configured.

nvidia-smi
nvcc --version

Refer to the following links to install CUDA.

In case you do not have access to a Linux machine, we recommend using a cloud GPU instance provider such as Vast.ai Console. As an example, Vast.ai can be connected in the following way:

  1. Go to templates and select the latest version nvidia/cuda image to create a new instance.
  2. Select the GPU type and the number of GPUs you want to use. (Recommended for benchmarking: 1x A100 SXM4) and select Rent.
  3. Then go to the Instances tab and press ► to start the instance.
  4. SSH into the instance using:- ssh -p \<Instance Port Range start\> root@\<public ip address\> -L 8080:localhost:8080

<a id="environment-setup"></a>Environment Setup

  1. Clone the repository
git clone https://github.com/TheoremOne/llm-benchmarker-suite.git
cd llm-benchmarker-suite
  1. We recommend using a virtual environment
python3 -m venv venv
source ./venv/bin/activate
  1. Install Poetry if not already installed (Ensure Poetry is added to your system's PATH) Run the following command if you haven't installed Poetry yet: curl -sSL https://install.python-poetry.org | python3 -

We use Poetry to ensure robust dependency management across various machines. Poetry provides advanced features for managing dependencies, project packaging, and publishing, making it a powerful choice for managing project dependencies.

  1. Install dependencies and submodules
poetry install
git submodule init && git submodule update
  1. Install the main package and submodules in editable mode
pip install -e .
cd opencompass && pip install -e .
cd ../FastChat && pip install -e ".[eval]"

<a id="important-suite-tools"></a> Important Suite Tools

The Suite consists of various tools designed to assist you in conducting metrics analysis on large language models. These tools offer diverse approaches tailored to your specific use case such as doing a static evaluation on stabndard dataset or using another LLM as a judge to check your inferencing capabilities.

<a id="opencompass"></a> Static Evaluations

This is a static evaluation package designed to assess the capabilities of a model through predefined measures and scenarios.

In the realm of model evaluation, "static evaluation" refers to an approach where assessments are performed on a fixed set of tasks, data, or benchmarks. These assessments provide a snapshot of a model's performance under specific conditions. Static evaluation contrasts with dynamic evaluation, where models are tested in more interactive, real-world scenarios.

Benefits of static evaluation:

  1. Controlled Environment: Static evaluations provide a controlled and repeatable testing environment, making it easier to compare different models objectively.
  2. Benchmarking: They enable direct comparison against established benchmarks, aiding in gauging a model's performance relative to others.
  3. Simplicity: Static evaluations can be simpler to set up and execute, requiring less complex infrastructure and data handling.

Downsides of static evaluation:

  1. Limited Realism: Since static evaluations operate within predefined scenarios, they might not fully capture a model's behavior in dynamic, real-world contexts.
  2. Lack of Adaptability: Static evaluations may not account for a model's ability to adapt or learn from ongoing interactions, which is essential for many applications.
  3. Potential Bias: The fixed nature of static evaluations might inadvertently introduce bias if the scenarios don't adequately represent the diversity of potential use cases.

Static evaluation offers controlled and comparable assessments of a model's performance within specific conditions. However, it may not capture the full range of a model's capabilities in real-world, dynamic settings. The choice between static and dynamic evaluation depends on the intended goals and the context of the evaluation.

python opencompass/run.py configs/eval_demo.py -w outputs/demo

<a id="llm-as-a-judge"></a> LLM-as-a-judge:

PaperLeaderboardMT-bench Human Annotation DatasetChatbot Arena Conversation Dataset

In this package, you can use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge.

MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants.

To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.

This package introduces a novel approach to model evaluation, leveraging the MT-bench framework for evaluating chat assistants through the lens of a Language Model (LLM)-as-a-judge paradigm.

Traditionally, model evaluation has often involved static benchmarks or human evaluators. While valuable, these approaches can have limitations in capturing the complexity of dynamic conversational interactions and may involve subjective biases. The MT-bench methodology seeks to address these challenges by presenting a set of intricate multi-turn open-ended questions tailored to test the abilities of chat assistants comprehensively.

Incorporating LLMs, such as GPT-4, as judges introduces a unique and automated dimension to the evaluation process. Here's the tradeoff and benefits of this innovative approach:

Benefits and Problem Solving:

  1. Enhanced Dynamic Assessment: Unlike static benchmarks, MT-bench questions simulate real-world conversational scenarios, allowing for a more dynamic assessment of a model's performance in multi-turn interactions.

  2. Objective and Consistent Judgment: By employing LLMs as judges, the evaluation process gains objectivity and consistency. LLMs, trained on vast amounts of text, can provide an unbiased and standardized measure of response quality.

  3. Efficiency and Automation: Using LLMs as judges automates the evaluation process, enabling rapid and scalable assessments. This is especially advantageous when dealing with a large number of models or frequent evaluations.

  4. Insights into Model Behavior: LLM judges can offer insights into a model's behavior and thought process during evaluation, shedding light on strengths and weaknesses that might not be apparent through other methods.

  5. Reduced Human Bias: Human evaluators may introduce subjective biases, whereas LLM judges are not influenced by external factors, leading to fairer and more consistent evaluations.

However, there are considerations to bear in mind:

Tradeoffs:

  1. Contextual Understanding: While LLMs excel in many language tasks, they might not fully understand nuanced context or domain-specific intricacies, potentially affecting their judgment accuracy.

  2. Interpreting Open-ended Responses: LLM judges may sometimes provide responses that are insightful but not entirely aligned with human intuition, requiring careful interpretation.

  3. Generalization: LLM judges' behavior might differ from human evaluators, necessitating efforts to ensure that the judgments align with human standards.

This innovative evaluation approach combines the MT-bench framework's rich and dynamic assessment with the objectivity and scalability of LLMs as judges. By automating the process and minimizing biases, this approach addresses challenges that traditional evaluation methods may encounter, providing a valuable tool for comprehensively evaluating chat assistants.

Evaluate a model on MT-bench

mt-bench-browser View data locally using python3 qa_browser.py --share

Step 1. Generate model answers to MT-bench questions
python gen_model_answer.py --model-path [MODEL-PATH] --model-id [MODEL-ID]

Arguments

Example

python gen_model_answer.py \
  --model-path lmsys/vicuna-7b-v1.3 \
  --model-id vicuna-7b-v1.3

The answers will be saved to data/mt_bench/model_answer/[MODEL-ID].jsonl.

To make sure FastChat loads the correct prompt template, see the supported models and how to add a new model here.

You can also specify --num-gpus-per-model for model parallelism (needed for large 65B models) and --num-gpus-total to parallelize answer generation with multiple GPUs.

Step 2. Generate GPT-4 judgments

There are several options to use GPT-4 as a judge, such as pairwise winrate and single-answer grading.

In MT-bench, we recommend single-answer grading as the default mode.

This mode asks GPT-4 to grade and give a score to model's answer directly without pairwise comparison.

For each turn, GPT-4 will give a score on a scale of 10. We then compute the average score on all turns.

python gen_judgment.py --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]

Example

python gen_judgment.py --parallel 2 --model-list \
  vicuna-13b-v1.3 \
  alpaca-13b \
  llama-13b \
  claude-v1 \
  gpt-3.5-turbo \
  gpt-4

The judgments will be saved to data/mt_bench/model_judgment/gpt-4_single.jsonl

Step 3. Show MT-bench scores

Show the scores for selected models

python show_result.py --model-list \
    vicuna-13b-v1.3 \
    alpaca-13b \
    llama-13b \
    claude-v1 \
    gpt-3.5-turbo \
    gpt-4

Show all scores

python show_result.py

mt-bench-browser

For more information on usage details, refer to the following docs.


<a id="openai-evals"></a> OpenAI Evals (Optional)

This is mostly useful for running evaluations on LLMs that can possibly generate untrusted code to prompts.

Evals is a framework for evaluating LLMs (large language models) or systems built using LLMs as components. It also includes an open-source registry of challenging evals.

An “eval” refers to a specific evaluation task that is used to measure the performance of a language model in a particular area, such as question answering or sentiment analysis. These evals are typically standardized benchmarks that allow for the comparison of different language models. The Eval framework provides a standardized interface for running these evals and collecting the results.

At its core, an eval is a dataset and an eval class that is defined in a YAML file. An example of an eval is shown below:

test-match:
id: test-match.s1.simple-v0
description: Example eval that checks sampled text matches the expected output.
disclaimer: This is an example disclaimer.
metrics: [accuracy]
test-match.s1.simple-v0:
class: evals.elsuite.basic.match:Match
args:
  samples_jsonl: test_match/samples.jsonl

We can run the above eval with a simple command:

oaieval gpt-3.5-turbo test-match

oaievals

Here we’re using the oaieval CLI to run this eval. We’re specifying the name of the completion function (gpt-3.5-turbo) and the name of the eval (test-match)

With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. An "eval" is a task used to evaluate the quality of a system's behavior. To get started, we recommend that you follow these steps:

To get set up with evals, follow the setup instructions.

Refer to the following Jupyter Notebooks for example of usages.

For more information on usage details, refer to the following docs.

<a id="running-evals"></a> Running evals

This concept allows for multiple levels of evaluating the effectiveness of a large language models.

Metrics

The metrics package is a Python library that provides various evaluation metrics commonly used to assess the performance of large language models. It includes functions to calculate metrics such as F1 score, accuracy, and BLEU score.

Installation

The metrics package is not available on PyPI and can be used as a standalone package. To integrate it into your project, you can directly copy the individual metric files from the metrics directory or clone the entire repository.

Usage

To use the metrics package, follow these steps:

Import the specific metric functions into your Python script or notebook:

from metrics.f1_score import calculate_f1_score
from metrics.accuracy import calculate_accuracy
from metrics.bleu_score import calculate_bleu_score
from metrics.utils import count_true_positives_negatives

Use the imported functions to evaluate your large language models. For example, if you have predictions and true labels for a binary classification task:

# Assume we have predictions and true labels as follows:
predictions = [1, 0, 1, 1, 0]
true_labels = [1, 0, 0, 1, 1]

# Calculate true positives and true negatives for binary classification
true_positives, true_negatives = count_true_positives_negatives(predictions, true_labels, positive_label=1)

# Calculate F1 score and accuracy
f1_score = calculate_f1_score(true_positives, false_positives, false_negatives)
accuracy = calculate_accuracy(true_positives, true_negatives, len(predictions))

print("F1 Score:", f1_score)
print("Accuracy:", accuracy)

Additionally, the calculate_bleu_score function can be used to compute the BLEU score for evaluating language generation tasks:

# Example usage for BLEU score calculation
reference_sentences = ["This is a reference sentence.", "Another reference sentence."]
predicted_sentence = "This is a predicted sentence."

bleu_score = calculate_bleu_score(reference_sentences, predicted_sentence)
print("BLEU Score:", bleu_score)

Notes

<a id="datasets"></a> Datasets

This is a utility package that allows efficient loading of popular datasets for evaluation. They use HuggingFace loaders by default.

from dataset.hellaswag import load_hellaswag_dataset
from dataset.race import load_race_dataset

# Example usage:
hellaswag_data = load_hellaswag_dataset()
race_data = load_race_dataset()

print("HellaSWAG dataset:", hellaswag_data)
print("RACE dataset:", race_data)

Notes

<a id="dataset-support"></a> Dataset Support

<table align="center"> <tbody> <tr align="center" valign="bottom"> <td> <b>Language</b> </td> <td> <b>Knowledge</b> </td> <td> <b>Reasoning</b> </td> <td> <b>Comprehensive Examination</b> </td> <td> <b>Understanding</b> </td> </tr> <tr valign="top"> <td> <details open> <summary><b>Word Definition</b></summary> </details> <details open> <summary><b>Idiom Learning</b></summary> </details> <details open> <summary><b>Semantic Similarity</b></summary> </details> <details open> <summary><b>Coreference Resolution</b></summary> </details> <details open> <summary><b>Translation</b></summary> </details> </td> <td> <details open> <summary><b>Knowledge Question Answering</b></summary> </details> <details open> <summary><b>Multi-language Question Answering</b></summary> </details> </td> <td> <details open> <summary><b>Textual Entailment</b></summary> </details> <details open> <summary><b>Commonsense Reasoning</b></summary> </details> <details open> <summary><b>Mathematical Reasoning</b></summary> </details> <details open> <summary><b>Theorem Application</b></summary> </details> <details open> <summary><b>Code</b></summary> </details> <details open> <summary><b>Comprehensive Reasoning</b></summary> </details> </td> <td> <details open> <summary><b>Junior High, High School, University, Professional Examinations</b></summary> </details> </td> <td> <details open> <summary><b>Reading Comprehension</b></summary> </details> <details open> <summary><b>Content Summary</b></summary> </details> <details open> <summary><b>Content Analysis</b></summary> </details> </td> </tr> </td> </tr> </tbody> </table>

<a id="model-support"></a> Model Support

<table align="center"> <tbody> <tr align="center" valign="bottom"> <td> <b>Open-source Models</b> </td> <td> <b>API Models</b> </td> <!-- <td> <b>Custom Models</b> </td> --> </tr> <tr valign="top"> <td> </td> <td> </td> <!-- - GLM - ... </td> --> </tr> </tbody> </table>