Home

Awesome

<div align="center">

Benchmarking LLMs via Uncertainty Quantification

Question Answering RC CI DRS DS
Llama-2 Mistral Falcon MPT Yi Qwen DeepSeek InternLM

đź“° Paper, :card_file_box: Datasets

</div>

1. Introduction

The proliferation of open-source Large Language Models (LLMs) from various institutions has highlighted the urgent need for comprehensive evaluation methods. However, current evaluation platforms, such as the widely recognized HuggingFace open LLM leaderboard, neglect a crucial aspect -- uncertainty, which is vital for thoroughly assessing LLMs.

<p align="center"> <img src="images/intro_exp.jpg" width="45%" /> <p align="center">Two LLMs can achieve the same accuracy score but demonstrate different levels of uncertainty.</p> </p>

To bridge this gap, we introduce a new benchmarking approach for LLMs that integrates uncertainty quantification. Our examination involves eight LLMs (LLM series) spanning five representative natural language processing tasks. Additionally, we introduce an uncertainty-aware evaluation metric, UAcc, which takes into account both prediction accuracy and prediction uncertainty. Our findings reveal that:

By taking uncertainty into account, our new UAcc metric can either amplify or diminish the relative improvement of one LLM over another and may even change the relative ranking of two LLMs, thus underscoring the significance of incorporating uncertainty in the evaluation of LLMs.

2. Uncertainty Quantification

We propose the utilization of conformal prediction for uncertainty quantification in LLMs. Compared to other methods, conformal prediction offers multiple advantages including ease of implementation, high efficiency, and a statistically rigorous estimation of uncertainty rather than a heuristic approximation.

<p align="center"> <img src="images/diagram.png" width="90%" /> <p align="center">The overall process of applying conformal prediction for uncertainty quantification in LLMs.</p> </p>

3. Evaluation Tasks and Datasets

In order to evaluate the performance of LLMs comprehensively, we consider five typical NLP tasks and prepare a dataset with 10,000 instances for each task.

We formulate each task as a multiple-choice question answering (MCQA) task and the objective is to select the only correct answer out of possible options.

4. Evaluation Results

Pretrained Base LLMs

We first compare the performance of various LLMs in terms of prediction accuracy (Acc), which measures the proportion of test instances whose true label has the highest predicted probability.

LLMsQARCCIDRSDSAvg.
Yi-34B71.2494.4893.9876.1271.4781.46
Qwen-72B72.5391.8688.0977.1360.6378.05
Qwen-14B64.2591.5291.0073.9049.3374.00
Llama-2-70B65.8689.3081.6267.0357.4172.24
DeepSeek-67B67.9788.6670.8274.6756.2071.66
Yi-6B57.5785.9976.5058.7266.0668.97
Mistral-7B60.4481.9462.9353.2162.1664.14
Llama-2-13B52.5277.2359.6652.6560.0560.42
Qwen-7B55.2183.8963.7064.0432.5359.87
InternLM-7B48.3773.8646.2143.7234.3849.31
Llama-2-7B45.6065.7943.0532.6145.6046.53
DeepSeek-7B45.6565.3942.6633.5042.1545.87
Qwen-1.8B44.7864.1436.5335.4830.7742.34
Falcon-40B40.1648.1125.9827.2531.0134.50
MPT-7B29.4931.6925.5024.3824.8627.18
Falcon-7B23.7524.9824.9125.8624.6924.84

We then compare the performance of various LLMs in terms of prediction uncertainty, which is measured as the average size of prediction sets of all test instances (SS). Note that a larger set size indicates higher uncertainty.

LLMsQARCCIDRSDSAvg.
Yi-34B2.601.711.901.771.691.93
Qwen-72B2.451.901.802.092.062.06
Qwen-14B2.801.742.021.942.372.17
Llama-2-70B2.621.781.822.342.252.16
DeepSeek-67B2.651.542.431.892.252.15
Yi-6B3.201.921.882.851.962.36
Mistral-7B2.801.752.482.712.402.43
Llama-2-13B3.062.242.722.552.242.56
Qwen-7B3.262.152.282.512.922.63
InternLM-7B3.492.193.283.634.473.41
Llama-2-7B3.202.393.273.263.303.09
DeepSeek-7B3.342.773.063.403.083.13
Qwen-1.8B3.202.583.493.454.183.38
Falcon-40B3.253.123.543.593.893.48
MPT-7B3.533.463.603.593.663.57
Falcon-7B3.903.603.663.643.923.75

In addition, we propose a new evaluation metric, Uncertainty-aware Accuracy (UAcc), which takes into account both prediction accuracy and prediction uncertainty.

<p align="center"> <img src="https://latex.codecogs.com/svg.image?UAcc=\frac{Acc}{SS}\sqrt{|\mathcal{Y}|},~\mathcal{Y}~denotes~the~option~set." /> </p> Note that UAcc can take values greater than 1.
LLMsQARCCIDRSDSAvg.
Yi-34B71.10163.56156.37108.12106.31121.09
Qwen-72B80.24152.50146.1296.0474.92109.96
Qwen-14B57.83157.52147.1397.7051.22102.28
Llama-2-70B65.20149.22124.2071.9862.5094.62
DeepSeek-67B66.38153.2773.10100.9761.6491.07
Yi-6B45.18132.61103.4150.9785.4983.53
Mistral-7B54.60124.7162.4548.1864.2570.84
Llama-2-13B42.5392.4653.8250.5266.0261.07
Qwen-7B42.45118.1069.4764.4227.2864.34
InternLM-7B34.1786.8434.5629.7318.8740.83
Llama-2-7B34.9767.9232.2524.5033.9138.71
DeepSeek-7B33.6358.5034.2324.1133.5236.80
Qwen-1.8B34.3661.5525.5925.2218.0432.95
Falcon-40B30.3237.8617.9818.6019.5224.85
MPT-7B20.4422.4317.3616.6616.6318.70
Falcon-7B14.9017.0116.6617.4115.4216.28

Instruction-Finetuned Chat LLMs

We adopt two methods to prepare the prompt input for instruction-finetuned LLMs. The first method aligns with the format of the instruction data (denoted as Chat-V1). This method aims to evaluate the LLM’s proficiency in adhering to instructions to accomplish tasks. The second method employs the same prompt format as the base version (denoted as Chat-V2). This method aims to assess the extent of the base LLM’s capabilities retained after instruction-finetuning.

<p align="center"> <img src="images/Llama-2.png" width="90%" /> <p align="center">Mean performance outcomes of the Llama-2 series’ pretrained base model and the instruction-finetuned chat model across five tasks.</p> </p> <details> <summary>Click to view results of the Yi series</summary> <p align="center"> <img src="images/Yi.png" width="90%" /> </p> </details> <details> <summary>Click to view results of the DeepSeek series</summary> <p align="center"> <img src="images/deepseek.png" width="90%" /> </p> </details> <details> <summary>Click to view results of the Falcon series</summary> <p align="center"> <img src="images/falcon.png" width="90%" /> </p> </details>

5. Usage

Installation

We have used Python 3.10.13 with the following dependencies.

pip install -r requirements.txt

Get Option Logits from LLMs

We prompt LLMs to obtain logit outputs corresponding to all options (i.e. A, B, C, D, E, and F).

python generate_logits.py \
  --model={path to model directory} \
  --data_path={path to data directory} \
  --file={name of dataset} \
  --prompt_method={base/shared/task} \
  --output_dir={output directory} \
  --few_shot={1 for few-shot and 0 for zero-shot}

or

python generate_logits_chat.py \
  --model={path to model directory} \
  --data_path={path to data directory} \
  --file={name of dataset} \
  --prompt_method={base/shared/task} \
  --output_dir={output directory} \
  --few_shot={1 for few-shot and 0 for zero-shot}

for chat version.

Apply Conformal Prediction for Uncertainty Quantification

We split each dataset into a calibration set and a test set, and apply conformal prediction to obtain prediction sets for all test set instances.

python uncertainty_quantification_via_cp.py \
  --model={model name} \
  --raw_data_dir={path to data directory} \
  --logits_data_dir={path to directory where option logits are stored} \
  --data_names={list of datasets to be evaluated} \
  --cal_ratio={how much data to be used as the calibration data, e.g., 0.5} \
  --alpha={error rata, e.g., 0.1} 

Take Qwen-72B as an example, we have

python uncertainty_quantification_via_cp.py \
  --model=Qwen-72B \
  --raw_data_dir=data \
  --logits_data_dir=outputs_base \
  --cal_ratio=0.5 \
  --alpha=0.1 
mmlu_10k_Acc: 72.53 
cosmosqa_10k_Acc: 91.86 
hellaswag_10k_Acc: 88.09 
halu_dialogue_Acc: 77.13 
halu_summarization_Acc: 60.63 
Average acc: 78.05 
mmlu_10k_SS: 2.45 
cosmosqa_10k_SS: 1.90 
hellaswag_10k_SS: 1.80 
halu_dialogue_SS: 2.09 
halu_summarization_SS: 2.06 
Average SS: 2.06 
mmlu_10k_Coverage Rate: 93.43 
cosmosqa_10k_Coverage Rate: 95.79 
hellaswag_10k_Coverage Rate: 93.99 
halu_dialogue_Coverage Rate: 93.02 
halu_summarization_Coverage Rate: 90.41 
Average Coverage Rate: 93.33 
mmlu_10k_UAcc: 80.24 
cosmosqa_10k_UAcc: 152.50 
hellaswag_10k_UAcc: 146.12 
halu_dialogue_UAcc: 96.04 
halu_summarization_UAcc: 74.92 
Average UAcc: 109.96

6. Citation

@article{ye2024llm_uq,
  title={Benchmarking LLMs via Uncertainty Quantification},
  author={Ye, Fanghua and Yang MingMing and Pang, Jianhui and Wang, Longyue and Wong, Derek F and Yilmaz Emine and Shi, Shuming and Tu, Zhaopeng},
  journal={arXiv preprint arXiv:2401.12794},
  year={2024}
  }

7. Contact

If you have any questions, feel free to raise an issue ro contact us at fanghua.ye.21@gmail.com.