Awesome
Evaluating the Moral Beliefs Encoded in LLMs
Authors: Nino Scherrer*, Claudia Shi*, Amir Feder and David Blei
Paper: Evaluating the Moral Beliefs Encoded in LLMs (NeurIPS 2023 - Spotlight).
Dataset: https://huggingface.co/datasets/ninoscherrer/moralchoice
@inproceedings{scherrer2023evaluating,
title={Evaluating the Moral Beliefs Encoded in LLMs},
author={Nino Scherrer and Claudia Shi and Amir Feder and David Blei},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
url={https://openreview.net/forum?id=O06z2G18me}
}
tl;dr
This paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (LLMs). It comprises two components:
A statistical method for eliciting beliefs encoded in LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM "making a choice", the associated uncertainty, and the consistency of that choice.
We apply this method to study what moral beliefs are encoded in different LLMs, especially in ambiguous cases where the right choice is not obvious. We design a large-scale survey comprising 680 high-ambiguity moral scenarios (e.g., "Should I tell a white lie?") and 687 low-ambiguity moral scenarios (e.g., "Should I stop for a pedestrian on the road?"). Each scenario includes a description, two possible actions, and auxiliary labels indicating violated rules (e.g., "do not kill"). We administer the survey to 28 open- and closed-source LLMs. We find that:
- In unambiguous scenarios, most models "choose" actions that align with commonsense. In ambiguous cases, most models express uncertainty.
- Some models are uncertain about choosing the commonsense action because their responses are sensitive to the question-wording.
- Some models reflect clear preferences in ambiguous scenarios. Specifically, OpenAI's
gpt-4
, Anthropic'sclaude-v1.1
,claude-instant-v1.1
and Google'stext-bison-001 (PaLM 2)
show strong agreement.
Overview (ToC)
Install
General Requirements:
>> python -m pip install -r requirements.txt
API-Keys: You must add you API-key's into the corresponding files in api_keys
.
Data
- Survey:
- Low and High-Ambiguity Scenarios in
data/scenarios
- Question Templates in
data/question_templates
- Response Templates in
data/response_templates
- Low and High-Ambiguity Scenarios in
- Survey Responses:
- Responses of 28 LLMs in
data/responses/paper
- Responses of 28 LLMs in
All available data is also shared through HuggingFace: https://huggingface.co/datasets/ninoscherrer/moralchoice
Model Evaluations
You can run a new evaluation on supported models by running:
>> python -m src.evaluate
--experiment-name "moraltest"
--dataset "low"
--model "openai/gpt-4"
--question-types "ab" "repeat" "compare"
--eval-nb-samples 5
This will generate multiple pickle files, i.e., one per (scenario, model, question_type) combination. This allows to resume interrupted experiments efficiently.
Finally, you can collect and aggregate all results into a single csv per model using:
>> python -m src.collect
--experiment-name "moraltest"
--dataset "low"
Visualizations
- Jupyter Notebooks to be uploaded
Models
Supported Models
<font size="1">- AI21
- J2-Instruct (API):
ai21/j2-grande-instruct
ai21/j2-jumbo-instruct
- J2-Instruct (API):
- Cohere
- Command (API):
cohere/command-xlarge
cohere/command-medium
- Command (API):
- OpenAI:
- GPT3 (API):
openai/text-ada-001
openai/text-babbage-001
openai/text-curie-001
openai/text-davinci-001
- GPT3.5 (API):
openai/text-davinci-002
openai/text-davinci-003
openai/gpt-3.5-turbo
- GPT-4 (API):
openai/gpt-4
- GPT3 (API):
- Anthropic
- Claude (API):
anthropic/claude-v1.2
anthropic/claude-v1.3
anthropic/claude-v2.0
- Claude Instant (API):
anthropic/claude-instant-v1.0
anthropic/claude-instant-v1.1
- Claude (API):
- Google
- Flan-T5 (HF-Hub):
google/flan-t5-small
google/flan-t5-base
google/flan-t5-large
google/flan-t5-xl
google/flan-t5-xxl
- PaLM 2 (API):
google/text-bison-001
- Flan-T5 (HF-Hub):
- Meta
- OPT-IML-Regular (HF-Hub):
meta/opt-iml-regular-small
meta/opt-iml-regular-large
- OPT-IML-Max (HF-Hub):
meta/opt-iml-max-small
meta/opt-iml-max-large
- OPT-IML-Regular (HF-Hub):
- BigScience
- Bloomz (HF-Hub):
bigscience/bloomz-560m
bigscience/bloomz-1b1
bigscience/bloomz-1b7
bigscience/bloomz-3b
bigscience/bloomz-7b1
bigscience/bloomz-7b1-mt
bigscience/bloomz-7b1-p3
- Bloomz (HF-Hub):
Adding New Models
- For most API-based models from supported providers, you can simply extend the MODELS dict in
models.py
- Otherwise, you can integrate the required model handler in
models.py
Adding More Question Templates
You can add further question templates to our evaluation procedure by:
- Adding
<question_name>.json
todata/question_templates
with the question template - Adding
<question_name>.json
todata/response_templates
with common answer patterns
Note:
- To retrieve common answer patterns for a specific question template, it is beneficial to generate a set of answers and then cluster common-prefixes.
- The current evaluation workflow only supports binary decision settings. But the pipeline can be extended to more options if needed.
🚧 Limitation 🚧: Semantic Matching
To map LLM outputs (i.e., sequences of tokens) to actions, we employ a rule-based matching function. This matching procedure is mainly based on common answer variations/patterns of the evaluated LLMs under the specific question templates. On average, we are able to match ~97% of the answers to action1
, action2
and refusal
and only classify ~3% of the answers as invalid. However, as this matching procedure does not account for all possible answer paraphrases,there may be some obvious answer matches in the response that were classified as invalid responses.
In future work, we intend to improve the current pipeline to also account for a broader set of answer variations by using a prompted LLMs or a bidirectional entailment algorithm (such as introduced in Kuhn et al. 2023) for matching.