Home

Awesome

Evaluating the Moral Beliefs Encoded in LLMs

Authors: Nino Scherrer*, Claudia Shi*, Amir Feder and David Blei

Paper: Evaluating the Moral Beliefs Encoded in LLMs (NeurIPS 2023 - Spotlight).

Dataset: https://huggingface.co/datasets/ninoscherrer/moralchoice

figure1

@inproceedings{scherrer2023evaluating,
  title={Evaluating the Moral Beliefs Encoded in LLMs},
  author={Nino Scherrer and Claudia Shi and Amir Feder and David Blei},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023},
  url={https://openreview.net/forum?id=O06z2G18me}
}

tl;dr

This paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (LLMs). It comprises two components:

A statistical method for eliciting beliefs encoded in LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM "making a choice", the associated uncertainty, and the consistency of that choice.

We apply this method to study what moral beliefs are encoded in different LLMs, especially in ambiguous cases where the right choice is not obvious. We design a large-scale survey comprising 680 high-ambiguity moral scenarios (e.g., "Should I tell a white lie?") and 687 low-ambiguity moral scenarios (e.g., "Should I stop for a pedestrian on the road?"). Each scenario includes a description, two possible actions, and auxiliary labels indicating violated rules (e.g., "do not kill"). We administer the survey to 28 open- and closed-source LLMs. We find that:

Overview (ToC)

Install

General Requirements:

>> python -m pip install -r requirements.txt

API-Keys: You must add you API-key's into the corresponding files in api_keys.

Data

All available data is also shared through HuggingFace: https://huggingface.co/datasets/ninoscherrer/moralchoice

Model Evaluations

You can run a new evaluation on supported models by running:

>> python -m src.evaluate
    --experiment-name "moraltest" 
    --dataset "low" 
    --model "openai/gpt-4"
    --question-types "ab" "repeat" "compare" 
    --eval-nb-samples 5

This will generate multiple pickle files, i.e., one per (scenario, model, question_type) combination. This allows to resume interrupted experiments efficiently.

Finally, you can collect and aggregate all results into a single csv per model using:

>> python -m src.collect 
    --experiment-name "moraltest" 
    --dataset "low" 

Visualizations

Models

Supported Models

<font size="1"> <font size="2">

Adding New Models

Adding More Question Templates

You can add further question templates to our evaluation procedure by:

Note:

🚧 Limitation 🚧: Semantic Matching

To map LLM outputs (i.e., sequences of tokens) to actions, we employ a rule-based matching function. This matching procedure is mainly based on common answer variations/patterns of the evaluated LLMs under the specific question templates. On average, we are able to match ~97% of the answers to action1, action2 and refusal and only classify ~3% of the answers as invalid. However, as this matching procedure does not account for all possible answer paraphrases,there may be some obvious answer matches in the response that were classified as invalid responses.

In future work, we intend to improve the current pipeline to also account for a broader set of answer variations by using a prompted LLMs or a bidirectional entailment algorithm (such as introduced in Kuhn et al. 2023) for matching.