Home

Awesome

<div align= "center"> <h1> MT-Bench-101 </h1> </div> <p align="center"> <!-- A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues Models --> [ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues </p> <p align="center"> πŸ“ƒ <a href="https://arxiv.org/pdf/2402.14762" target="_blank">Paper</a> β€’ <!-- πŸ€— <a href="https://huggingface.co" target="_blank">Data (WIP)</a> β€’ --> πŸ† <a href="" target="_blank">Leaderboard (WIP)</a> </p>

Todo

πŸ’₯What's New

About MT-Bench-101

MT-Bench-101 is specifically designed to evaluate the finegrained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks.

Installation

We integrated our MT-Bench-101 benchmark into OpenCompass through this PR. We integrated our MT-Bench-101 benchmark into our forked OpenCompass. OpenCompass is a comprehensive platform for large model evaluation, which provides a unified interface for evaluating various tasks and is easy to use.

<!-- [![evaluation](./doc/imgs/compass_support.svg)]({https://hub.opencompass.org.cn/dataset-detail/MT-Bench-101}) -->

Create virtual env

Create virtual env for OpenCompass, see OpenCompass website if you have any questions, and clone OpenCompass code.

conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
git clone https://github.com/sefira/opencompass opencompass
cd opencompass
pip install -e .

Data Preparation

Our data has been stored in the following folder under this repo.

# Dataset folder under this repo
data/subjective/mtbench101.jsonl

You should copy the data file from this repo into the same path of OpenCompass.

# Download dataset from this repo and copy to OpenCompass folder
# After 'cd opencompass'
mkdir data/subjective/
cp -rf $PATH_THIS_REPO/data/subjective/mtbench101.jsonl data/subjective/

Evaluation

# run
python run.py configs/eval_subjective_mtbench101.py
# debug
python run.py configs/eval_subjective_mtbench101.py --debug

Leaderboard

image

Citation

If you find our work helpful, feel free to give us a cite.

@article{bai2024mt,
  title={MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues},
  author={Bai, Ge and Liu, Jie and Bu, Xingyuan and He, Yancheng and Liu, Jiaheng and Zhou, Zhanhui and Lin, Zhuoran and Su, Wenbo and Ge, Tiezheng and Zheng, Bo and others},
  journal={arXiv preprint arXiv:2402.14762},
  year={2024}
}