Home

Awesome

<div align= "center"> <h1> 😐😨EmotionBench😠😭</h1> </div> <div align="center">

Dialogues Dialogues Dialogues Dialogues

</div> <div align="center"> <img src="logo.jpg" width="350px"> </div>

RESEARCH USE ONLY✅ NO COMMERCIAL USE ALLOWED❌

Benchmarking LLMs' Empathy Ability.

🛠️ Usage

✨An example run:

python run_emotionbench.py \
  --model gpt-3.5-turbo \
  --questionnaire PANAS \
  --emotion ALL \
  --select-count 5 \
  --default-shuffle-count 2 \
  --emotion-shuffle-count 1 \
  --test-count 1

✨An example result of overall analysis:

EmotionsPositive AffectNegative AffectN
Default43.3 $\pm$ 2.525.3 $\pm$ 0.63
Anger$\downarrow$ (-18.8)$-$ (-0.3)2
Anxiety$\downarrow$ (-11.3)$\downarrow$ (-3.8)2
Overall$\downarrow$ (-15.1)$-$ (-2.1)4

✨An example result of specific emotion analysis:

FactorsPositive AffectNegative AffectN
Default43.3 $\pm$ 2.525.3 $\pm$ 0.63
Facing Self-Opinioned People$\downarrow$ (-18.8)$-$ (-0.3)2
Overall$\downarrow$ (-18.8)$-$ (-0.3)2

🔧 Argument Specification

  1. --model: (Required) The name of the model to test.

  2. --questionnaire: (Required) Select the questionnaire(s) to run. For choices please see the list below.

  3. --emotion: (Required) Select the emotion(s) to run. For choices please see the list below.

  4. --select-count: (Required) Numbers of situations to select per factor. Defaults to 999 (select all situations).

  5. --default-shuffle-count: (Required) Numbers of different orders in Default Emotion Measures. If set zero, run only the original order. If set n > 0, run the original order along with its n permutations. Defaults to zero.

  6. --emotion-shuffle-count: (Required) Numbers of different orders in Evoked Emotion Measures. If set zero, run only the original order. If set n > 0, run the original order along with its n permutations. Defaults to zero.

  7. --test-count: (Required) Numbers of runs for a same order. Defaults to one.

  8. --name-exp: Name of this run. Is used to name the result files.

  9. --significance-level: The significance level for testing the difference of means between human and LLM. Defaults to 0.01.

  10. --mode: For debugging. To choose which part of the code is running.

Arguments related to openai API (can be discarded when users customize models):

  1. --openai-organization: Your organization ID. Can be found in Manage account -> Settings -> Organization ID.

  2. --openai-key: Your API key. Can be found in View API keys -> API keys.

🔨 Emotion Selection

Supported emotions: Anger, Anxiety, Depression, Frustration, Jealousy, Guilt, Fear, Embarrassment

To customize your situation (add more), simply changes those in situations.csv.

✨An example of situations.csv:

Anger-0Anger-1$\cdots$Anxiety-0Anxiety-1$\cdots$
Facing Self-Opinioned PeopleBlaming, Slandering, and Tattling$\cdots$External FactorsSelf-Imposed Pressure$\cdots$
When you ...When your ...$\cdots$You are ...You have ...$\cdots$
$\vdots$$\vdots$$\ddots$$\vdots$$\vdots$$\ddots$

📃 Questionnaire List

  1. Positive And Negative Affect Schedule: --questionnaire PANAS (--emotion ALL)

  2. Aggression Questionnaire: --questionnaire AGQ (--emotion Anger)

  3. Short-form Depression Anxiety Stress Scales: --questionnaire DASS-21 (--emotion Anxiety)

  4. Beck Depression Inventory: --questionnaire BDI (--emotion Depression)

  5. Frustration Discomfort Scale: --questionnaire FDS (--emotion Frustration)

  6. Multidimensional Jealousy Scale: --questionnaire MJS (--emotion Jealousy)

  7. Guilt And Shame Proneness: --questionnaire GASP (--emotion Guilt)

  8. Fear Survey Schedule: --questionnaire FSS (--emotion Fear)

  9. Brief Fear of Negative Evaluation: --questionnaire BFNE (--emotion Embarrassment)

🚀 Benchmarking Your Own Model

It is easy! Just replace the function example_generator fed into the function run_psychobench(args, generator).

Your customized function your_generator() does the following things:

  1. Read questions from the file args.testing_file. The file locates under results/ (check run_psychobench() in utils.py) and has the following format:
question-0order-0$\cdots$General_test-0_order-0$\cdots$Anger-0_scenario-0_test-0_order-0$\cdots$Anxiety-0_scenario-0_test-0_order-1
Prompt: ...Prompt: ...$\cdots$$\cdots$Imagine...$\cdots$Imagine...
1. Q11$\cdots$4$\cdots$3$\cdots$3
2. Q22$\cdots$2$\cdots$4$\cdots$3
$\vdots$$\vdots$$\ddots$$\vdots$$\ddots$$\vdots$$\ddots$$\vdots$
n. Qnn$\cdots$3$\cdots$3$\cdots$1

You can read the columns before each column starting with order-, which contains the shuffled questions for your input.

  1. Call your own LLM and get the results.

  2. Fill in the blank in the file args.testing_file. Remember: No need to map the response to its original order. Our code will take care of it.

Please check example_generator.py for datailed information.

👉 Paper and Citation

For more details, please refer to our paper <a href="https://arxiv.org/abs/2308.03656">here</a>.

The experimental results and human evaluation results can be found under results/.

Star History Chart

If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:

@inproceedings{huang2024apathetic,
  author    = {Jen{-}tse Huang and
               Man Ho Lam and
               Eric John Li and
               Shujie Ren and
               Wenxuan Wang and
               Wenxiang Jiao and
               Zhaopeng Tu and
               Michael R. Lyu},
  title     = {Apathetic or Empathetic? Evaluating {LLM}s' Emotional Alignments with Humans},
  booktitle = {Advances in Neural Information Processing Systems 37},
  year      = {2024}
}