Home

Awesome

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

This repository contains code for the LLMeBench framework (described in <a href="https://arxiv.org/abs/2308.04945" target="_blank">this paper</a>). The framework currently supports evaluation of a variety of NLP tasks using three model providers: OpenAI (e.g., GPT), HuggingFace Inference API, and Petals (e.g., BLOOMZ); it can be seamlessly customized for any NLP task, LLM model and dataset, regardless of language.

<!---"https://github.com/qcri/LLMeBench/assets/3918663/15d989e0-edc7-489a-ba3b-36184a715383"---> <p align="center"> <picture> <img alt = "The architecture of the LLMeBench framework." src="https://github.com/qcri/LLMeBench/assets/3918663/7f7a0da8-cd73-49d5-90d6-e5c62781b5c3" width="400" height="250"/> </picture> </p>

Recent Updates

Overview

<p align="center"> <picture> <img alt = "Summary and examples of the 53 datasets, 31 tasks, 3 model providers and metrics currently implemented and validated in LLMeBench." src="https://github.com/qcri/LLMeBench/assets/3918663/8a0ddf60-5d2f-4e8c-a7d9-de37cdeac104" width="510" height="160"/> </picture> </p>

Developing LLMeBench is an ongoing effort and it will be continuously expanded. Currently, the framework features the following:

Quick Start!

  1. Install LLMeBench: pip install 'llmebench[fewshot]'

  2. Download the current assets: python -m llmebench assets download. This will fetch assets and place them in the current working directory.

  3. Download one of the dataset, e.g. ArSAS. python -m llmebench data download ArSAS. This will download the data to the current working directory inside the data folder.

  4. Evaluate!

    For example, to evaluate the performance of a random baseline for Sentiment analysis on ArSAS dataset, you can run:

    python -m llmebench --filter 'sentiment/ArSAS_Random*' assets/ results/
    

    which uses the ArSAS_random "asset": a file that specifies the dataset, model and task to evaluate. Here, ArSAS_Random is the asset name referring to the ArSAS dataset name and the Random model, and assets/ar/sentiment_emotion_others/sentiment/ is the directory where the benchmarking asset for the sentiment analysis task on Arabic ArSAS dataset can be found. Results will be saved in a directory called results.

Get the Benchmark Data

In addition to supporting the user to implement their own LLM evaluation and benchmarking experiments, the framework comes equipped with benchmarking assets over a large variety of datasets and NLP tasks. To benchmark models on the same datasets, the framework automatically downloads the datasets when possible. Manually downloading them (for example to explore the data before running any assets) can be done as follows:

python -m llmebench data download <DatasetName>

Voilà! all ready to start evaluation...

Note: Some datasets and associated assets are implemented in LLMeBench but the dataset files can't be re-distributed, it is the responsibility of the framework user to acquire them from their original sources. The metadata for each Dataset includes a link to the primary page for the dataset, which can be used to obtain the data. The data should be downloaded and present in a folder under data/<DatasetName>, where <DatasetName> is the same as implementation under llmebench.datasets. For instance, the ADIDataset should have it's data under data/ADI/.

Disclaimer: The datasets associated with the current version of LLMeBench are either existing datasets or processed versions of them. We refer users to the original license accompanying each dataset as provided in the metadata for each dataset script. It is our understanding that these licenses allow for datasets use and redistribution for research or non-commercial purposes .

Usage

To run the benchmark,

python -m llmebench --filter '*benchmarking_asset*' --limit <k> --n_shots <n> --ignore_cache <benchmark-dir> <results-dir>

Parameters

Outputs Format

<results-dir>: This folder will contain the outputs resulting from running assets. It follows this structure:

jq is a helpful command line utility to analyze the resulting json files. The simplest usage is jq . summary.jsonl, which will print a summary of all samples and model responses in a readable form.

Caching

The framework provides caching (if --ignore_cache isn't passed), to enable the following:

Running Few Shot Assets

The framework has some preliminary support to automatically select n examples per test sample based on a maximal marginal relevance-based approach (using langchain's implementation). This will be expanded in the future to have more few shot example selection mechanisms (e.g Random, Class based etc.).

To run few shot assets, supply the --n_shots <n> option to the benchmarking script. This is set to 0 by default and will run only zero shot assets. If --n_shots is > zero, only few shot assets are run.

Tutorial

The tutorials directory provides tutorials on the following: updating an existing asset, advanced usage commands to run different benchmarking use cases, and extending the framework by at least one of these components:

Citation

Please cite our papers when referring to this framework:

@inproceedings{abdelali-2024-larabench,
  title = "{{LAraBench}: Benchmarking Arabic AI with Large Language Models}",
  author ={Ahmed Abdelali and Hamdy Mubarak and Shammur Absar Chowdhury and Maram Hasanain and Basel Mousi and Sabri Boughorbel and Samir Abdaljalil and Yassine El Kheir and Daniel Izham and Fahim Dalvi and Majd Hawasly and Nizi Nazar and Yousseif Elshahawy and Ahmed Ali and Nadir Durrani and Natasa Milic-Frayling and Firoj Alam},
  booktitle = {Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers},
  month = mar,
  year = {2024},
  address = {Malta},
  publisher = {Association for Computational Linguistics},
}

@article{dalvi2023llmebench,
      title={{LLMeBench}: A Flexible Framework for Accelerating LLMs Benchmarking},
      author={Fahim Dalvi and Maram Hasanain and Sabri Boughorbel and Basel Mousi and Samir Abdaljalil and Nizi Nazar and Ahmed Abdelali and Shammur Absar Chowdhury and Hamdy Mubarak and Ahmed Ali and Majd Hawasly and Nadir Durrani and Firoj Alam},
      booktitle = {Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations},
      month = mar,
      year = {2024},
      address = {Malta},
      publisher = {Association for Computational Linguistics},
}