Home

Awesome

MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models

<div align="center"><img src="https://img.shields.io/badge/Data%20License-MIT-blue" alt=""> <img src="https://img.shields.io/badge/Code%20License-MIT-green" alt=""> <img src="https://img.shields.io/badge/python-3.10+-red" alt=""> <br> <a href="https://arxiv.org/abs/2401.16745"> <strong>πŸ“ƒ Paper</strong> </a> β€’ <a href="https://huggingface.co/datasets/wckwan/MT-Eval"> <strong>πŸ€— Dataset</strong> </a></div>
<span id="content"> </span>

πŸ“š Content


<span id="introduction"> </span>

πŸ“˜ 1. Introduction [Back to Top]

Large language models (LLMs) are increasingly relied upon for complex multi-turn conversations across diverse real-world applications. However, existing benchmarks predominantly focus on single-turn evaluations, overlooking the models' capabilities in multi-turn interactions. To address this gap, we introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. We construct multi-turn queries for each category either by augmenting existing datasets or by creating new examples with GPT-4 to avoid data leakage. To study the factors impacting multi-turn abilities, we create single-turn versions of the 1170 multi-turn queries and compare performance. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks. We observe significant performance degradation in multi-turn settings compared to single-turn settings in most models, which is not correlated with the models' fundamental capabilities. Moreover, we identify the distance to relevant content and susceptibility to error propagation as the key factors influencing multi-turn performance.

<div align="center"><img src="figures/main_figure.svg" style="text-align:left;" alt="Overview of MT-Eval" width="80%"> <br><figcaption style="text-align:left;">Illustration of the four dialogue tasks in MT-Eval: Recollection, Expansion, Refinement, and Follow-up. Recollection accesses the model’s ability to recall information from previous conversations. Expansion evaluates the model’s capacity to address queries surrounding the same topic. Refinement gauges the model’s adherence to progressively complex instructions. Follow-up examines the model’s proficiency in responding to queries that build upon its preceding response. A more detailed description of these tasks can be found in Section 3 of the paper.</figcaption></div>
<span id="statistics"> </span>

πŸ“Š 2. Benchmark Statistics [Back to Top]

StatisticsRecollectionExpansionRefinementFollow-upAll
Avg. # Turns per Dialogue107.0012.003.006.96
Avg. # Words in Prompt $\dagger$693.09539.60882.85686.82760.41
Max. # Words in Prompt $\dagger$2331838257419322574
Avg. # Words in Response $\dagger$72.0724.4178.50205.8899.31
Max. # Words in Response $\dagger$289107430444444
Avg. # Words per Turn54.49156.7765.8931.7860.63
Max. # Words per Turn330474449262474
Total # Dialogues38104080168
Total # Turns380704802401170

$\dagger$: Estimated using GPT-4 responses.


<span id="leaderboard"> </span>

πŸ† 3. Leaderboard [Back to Top]

ModelAvg.RecollectionExpansionRefinementFollow-up
GPT-3.5-Turbo7.726.907.876.929.21
GPT-49.039.619.077.859.60
ChatGLM3-6B5.492.925.904.738.39
Vicuna-7B-v1.56.445.456.705.318.31
Vicuna-13B-v1.57.016.276.706.378.68
Llama-2-chat-7B6.113.865.876.208.53
Llama-2-chat-13B6.313.666.376.378.82
Qwen-chat-7B6.555.257.025.478.49
Qwen-chat-14B7.266.217.586.119.12
Mistral-Instruct-7B7.467.226.986.589.05
Mixtral-Instruct-8x7B7.476.177.426.779.52

<span id="setup"> </span>

πŸ› οΈ 4. Setup [Back to Top]

Execute the following command to create the conda environment for inference and evaluation. This environment will install PyTorch 1.13.1 with CUDA 11.6. If your system requires a different CUDA version, adjust the - pytorch-cuda=11.6 line in the environment.yml file to match your CUDA version.

conda env create --file environment.yml

For enhanced performance, we recommend installing Flash-Attention. This step is not mandatory but can improve processing speed.

pip install flash-attn --no-build-isolation

<span id="data"> </span>

πŸ—‚οΈ 5. Data [Back to Top]

<span id="load_data"> </span>

5.1. Load Data

Data can be loaded from Hugging Face as demonstrated by the following Python code:

from datasets import load_dataset

tasks = [
  "refinement_single",
  "refinement_multi",
  "expansion_single",
  "expansion_multi",
  "follow-up_single",
  "follow-up_multi",
  "recollection_single_cls",
  "recollection_multi_cls",
  "recollection_single_global-inst",
  "recollection_multi_global-inst",
]

for task in tasks:
    data = load_dataset('wckwan/MT-Eval', task, split='test')

Task Descriptions:

data is a list of dialogue instances. Each dialogue instance follows this format:

{
    "conv" : [
        {
            "user": "<str: User utterance>",
            "sys": "<str: System response>",
            "id": "<str: Turn ID>", 
            "inst": "<str: Instruction in user utterance>",
            "do_inference": "<bool: Indicate if inference is required>",
        },
        {
          ...
        },
    ],
    "id": "<str: Dialogue ID>", 
}
<span id="data_creation"> </span>

5.2. Data Creation

The full data is available in Hugging Face as described in the previous section. The process to construct the data is outlined below.

The raw data and prompts used for generating the dataset are organized as follows:

raw_data/
β”œβ”€β”€ documents.jsonl                # The 100 documents used in various tasks.
β”œβ”€β”€ global_inst.jsonl              # Instructions subset from IFEval and queries. 
β”œβ”€β”€ mt-bench_extended.jsonl        # Extended MT-Bench with three extra turns.
β”œβ”€β”€ refinement_multi_inst.jsonl    # Instructions for the multi-turn refinement task. 
└── refinement_single_inst.jsonl   # Instructions for the single-turn refinement task.

prompts/
β”œβ”€β”€ construct_sum.txt              # Generates document summary. 
β”œβ”€β”€ construct_ner_pos.txt          # Generates named-entity recognition or part-of-speech queries.
β”œβ”€β”€ construct_qa.txt               # Generates question and answer pairs. 
β”œβ”€β”€ construct_rel.txt              # Generates relations 
β”œβ”€β”€ construct_translation.txt      # Generates translation queries and answers. 
β”œβ”€β”€ construct_mt_bench.txt         # Generates additional turns for MT-Bench.
β”œβ”€β”€ construct_paragraph.txt        # Generates documents. 
...

To generate the dataset, run the following script:

python create_data.py

<span id="inference"> </span>

🧠 6. Inference [Back to Top]

<span id="open_source_inference"> </span>

6.1 Open-source Model Setup

For inference with open-source models, configure the settings in utils/misc.py as follows:

config = {
  "<model_alias>": {
    "path": <str: HuggingFace model name or local path>,
    "max_context_len": <int: Maximum context length>,
    "chat_template": <Conversation: Chat prompt from FastChat library>
    "use_flash_attn": <bool: Support for flash attention>
    "end_tokens": <list of str: Additional end tokens to cut off>
  },
  ...
}

Settings for models used in our paper (vicuna-7b, vicuna-13b, llama2-chat-7b, llama2-chat-13b, qwen-chat-7b, qwen-chat-14b, chatglm3-6b, mixtral-instruct-v0.1, mistral-instruct-v0.2) are already specified.

<span id="openai_inference"> </span>

6.2. OpenAI Model Setup

For inference with OpenAI models, add your API keys to utils/api_keys.json:

[
  {
    "key": "<key1>"
  },
  {
    "key": "<key2>"
  },
  ...
]
<span id="inference_script"> </span>

6.3. Inference Script

Run the script below to perform inference on tasks from the main experiments:

for task in "refinement_multi" "expansion_multi" "follow-up_multi" "recollection_multi_cls" "recollection_multi_global-inst"
do
  python inference.py \
  --model_name <model_alias>  \
  --task ${task}
done

Arguments:

Inference results are saved in inference_outputs/.


<span id="ablation"> </span>

πŸ§ͺ 7. Ablation Study [Back to Top]

<span id="ablation_single_turn"> </span>

7.1. Single-Turn Setting

Run the script below to evaluate the model in a single-turn setting across four dialogue tasks:

for task in "refinement_single" "expansion_single" "follow-up_single" "recollection_single_cls" "recollection_single_global-inst"
do
  python inference.py \
  --model_name <model_alias>  \
  --task ${task}
done

For more details on the inference script, refer to the Inference section.

<span id="ablation_gold_context"> </span>

7.2. Gold Context Setting

To perform inference using human-verified GPT-4 outputs as the dialogue history, run the following script:

for task in "refinement_multi" "expansion_multi" "follow-up_multi" "recollection_multi_cls" "recollection_multi_global-inst"
do
  python inference.py \
  --model_name <model_alias>  \
  --use_gold_history \
  --task ${task} \
done
<span id="ablation_cls"> </span>

7.3. Classification With Various Context

<span id="ablation_cls"> </span>

Document classification under four different settings by running the script below:

for task in "cls_ablation_gold" "cls_ablation_dgc" "cls_ablation_sgc" "cls_ablation_rc"
do
  python inference.py \
  --model_name <model_alias>  \
  --use_gold_history \
  --task ${task} \
done

7.4. Irrelevant Context

<span id="ablation_irrelevant"> </span>

Run the following script to perform inference in refinement tasks with irrelevant turns inserted.

for task in "refinement_ablation_irrelevant-front" "refinement_ablation_irrelevant-between" 
  do
  python inference.py \
  --model_name <model_alias>  \
  --task ${task} \
  done

This ablation study requires GPT-4 evaluation outlined below


<span id="evaluation"> </span>

πŸ“ˆ 8. Evaluation [Back to Top]

<span id="gpt4_evaluation"> </span>

8.1. GPT-4 Evaluation

To use GPT-4 for evaluating responses, first, add one or more API keys to utils/api_keys.json. Then, execute the script below:

python evaluation.py \
  --model_name <model_alias> \
  --task_names [<task A>, <task B>]

Arguments:

Evaluation results will be stored in evaluation_outputs/.

<span id="calculate_score"> </span>

8.2. Calculate Score

To calculate scores for the tasks, use the following command:

python calculate_score.py

Scores for various tasks and settings will be saved in results/result.md.


<span id="citation"> </span>

πŸ“„ Citation

If you find our paper and resources useful, please consider citing our paper:

@misc{kwan2024mteval,
      title={MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models}, 
      author={Wai-Chung Kwan and Xingshan Zeng and Yuxin Jiang and Yufei Wang and Liangyou Li and Lifeng Shang and Xin Jiang and Qun Liu and Kam-Fai Wong},
      year={2024},
      eprint={2401.16745},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}