Awesome

Supplementary materials for the paper "Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences" (Emelin et al., 2021)

Dataset is now also available on HuggingFace: https://huggingface.co/datasets/demelin/moral_stories.
Full paper is available here: https://aclanthology.org/2021.emnlp-main.54.pdf

Abstract: In social settings, much of human behavior is governed by unspoken rules of conduct. For artificial systems to be fully integrated into social environments, adherence to such norms is a central prerequisite. We investigate whether contemporary NLG models can function as behavioral priors for systems deployed in social settings by generating action hypotheses that achieve predefined goals under moral constraints. Moreover, we examine if models can anticipate likely consequences of (im)moral actions, or explain why certain actions are preferable by generating relevant norms. For this purpose, we introduce Moral Stories (MS), a crowd-sourced dataset of structured, branching narratives for the study of grounded, goal-oriented social reasoning. Finally, we propose decoding strategies that effectively combine multiple expert models to significantly improve the quality of generated actions, consequences, and norms compared to strong baselines, e.g. though abductive reasoning.

Dataset

Overview

The Moral Stories dataset is available at https://tinyurl.com/moral-stories-data. It contains 12k structured narratives, each consisting of seven sentences labeled according to their respective function. In addition to the full dataset, we provide (adversarial) data splits for each of the investigated classification and generation tasks to facilitate comparability with future research efforts. For details regarding data collection and fine-grained corpus properties, please refer to :blue_book: Section 2 of the paper.

Story examples

Quickstart guide: Evaluating models on Wino-X

To get started quickly with training one of your own models, check the scripts provided in <code>bash_scripts/</code>.
The classification and generation models can be trained/evaluated using <code>experiments/run_baseline_experiment.py</code>.
Evaluation for classification can be run using the same script. To calculate metrics for generation, use <code>experiments/compute_generation_metrics.py</code> script on specified model generations.

Codebase details

We provide code for the replication of data curation steps as well as experiments discussed in our paper. <code>requirements.txt</code> specifies libraries utilized by the codebase. Example shell scripts used to run each experiment can be found in <code>/bash_scripts</code> whereas their Beaker analogues are provided in <code>/beaker_scripts</code>. The following briefly describes individual files included in the codebase:

Dataset collection

(:blue_book: See Section 2 of the paper.)

<code>collect_sc101_writing_prompts.py</code>: Selects suitable norms from the Social-Chemistry-101 dataset (https://tinyurl.com/y7t7g2rx) to be used as writing prompts for crowd-workers.
<code>show_human_validation_stats.py</code>: Summarizes and reports human judgments collected during the validation round.
<code>remove_low_scoring_stories.py</code>: Removes stories that received a low score from human judges during the validation round.
<code>show_dataset_stats.py</code>: Computes and reports various dataset statistics.
<code>identify_latent_topics.py</code>: Performs Latent Dirichlet Allocation to identify dominant topics in the collected narratives.

Split creation

(:blue_book: See Section 3 of the paper.)

<code>create_action_lexical_bias_splits.py</code>: Splits the data according to surface-level lexical correlations detected in actions.
<code>create_consequence_lexical_bias_splits.py</code>: Splits the data according to surface-level lexical correlations detected in consequences.
<code>create_minimal_action_pairs_splits.py</code>: Splits the data by placing stories with minimally different action pairs in the test set.
<code>create_minimal_consequence_pairs_splits.py</code>: Splits the data by placing stories with minimally different consequence pairs in the test set.
<code>create_norm_distance_splits.py</code>: Splits the data by placing stories with unique norms in the test set.

Experiments

(:blue_book: See Sections 3 and 4 of the paper.)

<code>compute_generation_metrics.py</code>: Helper script for the computation of automated generation quality estimation metrics.
<code>compute_norm_diversity.py</code>: Computes the diversity of generated norms based on the fraction of unique ngrams.
<code>run_baseline_experiment.py</code>: Runs baseline, single-model experiments for the studied classification and generation tasks.
<code>run_coe_action_ranking_experiment.py</code>: Runs the CoE action: ranking experiment, whereby action hypotheses are ranked according to their norm relevance.
<code>run_coe_action_abductive_refinement_experiment.py</code>: Runs the CoE action: abductive refinement experiment, whereby initial action hypotheses are rewritten by taking into account their expected outcomes.
<code>run_coe_consequence_ranking_experiment.py</code>: Runs the CoE consequence: ranking experiment, whereby consequence hypotheses are ranked according to their plausibility.
<code>run_coe_consequence_iterative_refinement_experiment.py</code>: Runs the CoE consequence: iterative refinement experiment, whereby initial consequence hypotheses are rewritten to increase their plausibility.
<code>run_coe_norm_synthetic_consequences_experiment.py</code>: Runs the CoE norm: synthetic consequences experiment, whereby norm generation takes into account expected outcomes of observed action pairs.
<code>utils.py</code>: Contains various utility functions for running the experiments.

Human evaluation

(:blue_book: See Section 4 of the paper.)

<code>get_action_stats.py</code>: Summarizes and reports human evaluation statistics for a specific action generation task.
<code>get_consequence_stats.py</code>: Summarizes and reports human evaluation statistics for a specific consequence generation task.
<code>get_norm_stats.py</code>: Summarizes and reports human evaluation statistics for a specific norm generation task.

Citation

@inproceedings{emelin-etal-2021-moral,
    title = "Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences",
    author = "Emelin, Denis  and
      Le Bras, Ronan  and
      Hwang, Jena D.  and
      Forbes, Maxwell  and
      Choi, Yejin",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.54",
    doi = "10.18653/v1/2021.emnlp-main.54",
    pages = "698--718",
    abstract = "In social settings, much of human behavior is governed by unspoken rules of conduct rooted in societal norms. For artificial systems to be fully integrated into social environments, adherence to such norms is a central prerequisite. To investigate whether language generation models can serve as behavioral priors for systems deployed in social settings, we evaluate their ability to generate action descriptions that achieve predefined goals under normative constraints. Moreover, we examine if models can anticipate likely consequences of actions that either observe or violate known norms, or explain why certain actions are preferable by generating relevant norm hypotheses. For this purpose, we introduce Moral Stories, a crowd-sourced dataset of structured, branching narratives for the study of grounded, goal-oriented social reasoning. Finally, we propose decoding strategies that combine multiple expert models to significantly improve the quality of generated actions, consequences, and norms compared to strong baselines.",
}