Awesome

A Survey about Alignment Goals

A collection of papers and resources related to alignment for big models.

The organization of papers refers to our survey "From Instructions to Basic Human Values: A Survey of Alignment Goals for Big Models".

<details> <ol> <li> <a href="#definition-of-alignment-goals">Definition of Alignment Goals</a> <ul> <li><a href="#human-instructions">Human Instructions</a></li> <li><a href="#human-preferences">Human Preferences</a></li> <ul> <li><a href="#human-demonstrations">Human Demonstrations</a></li> <li><a href="#human-feedback">Human Feedback</a></li> <li><a href="#model-synthetic-feedback">Model Synthetic Feedback</a></li> </ul> <li><a href="#human-values">Human Values</a></li> <ul> <li><a href="#value-principles">Value Principles</a></li> <li><a href="#target-representation">Target Representation</a></li> </ul> </ul> </li> <li> <a href="#evaluation-of-alignment">Evaluation of Alignment</a> <ul> <li><a href="#human-instructions-1">Human Instructions</a></li> <ul> <li><a href="#benchmarks">Benchmarks</a></li> <li><a href="#automatic-chatbot-arenas">Automatic Chatbot Arenas</a></li> </ul> <li><a href="#human-preferences-1">Human Preferences</a></li> <ul> <li><a href="#benchmarks-1">Benchmarks</a></li> <li><a href="#human-evaluation">Human Evaluation</a></li> <li><a href="#reward-model">Reward Model</a></li> </ul> <li><a href="#human-values-1">Human Values</a></li> <ul> <li><a href="#benchmarks-2">Benchmarks</a></li> <li><a href="#reward-model-1">Reward Model</a></li> </ul> </ul> <li><a href="#citation">Citation</a></li> </ol> </details>

Human Instructions

Motivation and Definition

Alignment Goal Achievement

Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
Cross-task generalization via natural language crowdsourcing instructions. Mishra et al. arXiv 2021. [Paper][Data][Project]
Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
Glm-130b: An open bilingual pre-trained model. Zeng et al. arXiv 2022. [Paper][Project]
Crosslingual generalization through multitask finetuning. Muennighoff et al. arXiv 2022. [Paper][Project]
Unnatural instructions: Tuning language models with (almost) no human labor. Honovich et al. arXiv 2022. [Paper][Data]
Self-instruct: Aligning language model with self generated instructions. Wang et al. arXiv 2022. [Paper][Data]
Scaling instruction-finetuned language models. Chung et al. arXiv 2021. [Paper]
The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
Stanford alpaca: An instruction-following llama model. Taori et al. 2023 [Blog][Data]
Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality. Chiang et al. See https://vicuna 2023. [Paper][Project][Data]
Baize: An open-source chat model with parameter-efficient tuning on self-chat data. Xu et al. arXiv 2023. [Paper][Data]
Improving multimodal interactive agents with reinforcement learning from human feedback. Abramson et al. arXiv 2022. [Paper]
Aligning text-to-image models using human feedback. Lee et al. arXiv 2023. [Paper]
Visual instruction tuning. Liu et al. arXiv 2023. [Paper][Project]
Llavar: Enhanced visual instruction tuning for text-rich image understanding. Zhang et al. arXiv 2023. [Paper][Project]

Alignment Goal Evaluation

Benchmarks

Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Huang et al. arXiv 2023. [Paper][Project]
Agieval: A human-centric benchmark for evaluating foundation models. Zhong et al. arXiv 2023. [Paper][Project]
Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]

Automatic Chatbot Arenas

Alpacaeval: An automatic evaluator of instruction-following models. Li et al. 2023. [Project]
Alpacafarm: A simulation framework for methods that learn from human feedback. Dubois et al. arXiv 2023. [Paper][Project]
Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality*. Chiang et al. See https://vicuna 2023. [Paper][Project]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Zheng et al. arXiv 2023. [Paper][Project]

Human Preferences

Human Demonstrations

Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
Recursively summarizing books with human feedback. Wu et al. arXiv 2021. [Paper][Data]
Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
OpenAssistant Conversations--Democratizing Large Language Model Alignment. Kopf et al. arXiv 2023. [Paper][Project][Data][Checkpoint]
Reward design with language models. Kwon et al. arXiv 2023. [Paper]

Human Feedback

Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
Recursively summarizing books with human feedback. Wu et al. arXiv 2021. [Paper][Data]
Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
OpenAssistant Conversations--Democratizing Large Language Model Alignment. Kopf et al. arXiv 2023. [Paper][Project][Data][Checkpoint]
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. Wu et al. arXiv 2023. [Paper][Project][Data]

Model Synthetic Feedback

Reward design with language models. Kwon et al. arXiv 2023. [Paper]
Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]
Training Socially Aligned Language Models in Simulated Human Society. Liu et al. arXiv 2023. [Paper][Project]
Training Language Models with Language Feedback at Scale. Jeremy Scheurer et al. arXiv 2023. [Paper][Data][Project]
Visual Instruction Tuning with Polite Flamingo. Chen et al. arXiv 2023. [Paper][Project][Data]

Human Values

Value Principles

HHH (Helpful & Honest & Harmless)

Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
A general language assistant as a laboratory for alignment. Askell et al. arXiv 2021. [Paper]
Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Project][Data]
Process for adapting language models to society (palms) with values-targeted datasets. Solaiman et al. Neurips 2021. [Paper]

Social Norms & Ethics

The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. Lourie et al. AAAI 2021. [Paper][Data]
MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions. Sun et al. ACL 2023. [Paper][Project]
Learning norms from stories: A prior for value aligned agents. Nahian et al. AIES 2020. [Paper]

Basic Value Theory

An overview of the Schwartz theory of basic values. Schwartz et al. Online readings in Psychology and Culture 2012. [Paper]
Rokeach value survey. Rokeach et al. The nature of human values. 1967. [Paper]
Life values inventory: Facilitator's guide. Brown et al. Willianmsburg, VA 2002. [Paper]
Moral foundations theory: The pragmatic validity of moral pluralism. Graham et al. Advances in experimental social psychology, 2013. [Paper]

Target Representation

Desirable Behaviors

Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. Ganguliet al. arXiv 2022. [Paper][Data]
Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
Social bias frames: Reasoning about social and power implications of language. Sap et al. arXiv 2019. [Paper][Data]

Intrinsic Values

Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Data]
Process for adapting language models to society (palms) with values-targeted datasets. Solaiman et al. Neurips 2021. [Paper]
The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
Can machines learn morality? the delphi experiment. Jiang et al. arXiv 2021. [Paper][Project]

Evaluation of Alignment Targets

Human Instructions

Benchmarks

Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Huang et al. arXiv 2023. [Paper][Project]
Agieval: A human-centric benchmark for evaluating foundation models. Zhong et al. arXiv 2023. [Paper][Project]
Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]

Automatic Chatbot Arenas

Alpacaeval: An automatic evaluator of instruction-following models. Li et al. 2023. [Project]
Alpacafarm: A simulation framework for methods that learn from human feedback. Dubois et al. arXiv 2023. [Paper][Project]
Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality*. Chiang et al. See https://vicuna 2023. [Paper][Project]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Zheng et al. arXiv 2023. [Paper][Project]

Human Preferences

Benchmarks

TruthfulQA: Measuring how models mimic human falsehoods. Lin et al. arXiv 2022. [Paper][Data]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Mihaylov et al. EMNLP 2018. [Paper][Data]
CrowS-pairs: A challenge dataset for measuring social biases in masked language models. Nangia et al. arXiv 2020. [Paper][Data]
Gender bias in coreference resolution. Rudinger et al. arXiv 2018. [Paper][Data]
BBQ: A hand-built bias benchmark for question answering. Parrish et al. arXiv 2021. [Paper][Data]
Bold: Dataset and metrics for measuring biases in open-ended language generation. Dhamala et al. FAccT 2021. [Paper][Data]
Realtoxicityprompts: Evaluating neural toxic degeneration in language models. Gehman et al. arXiv 2020. [Paper][Data]
Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. Hartvigsen et al. arXiv 2022. [Paper][Data]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
Holistic evaluation of language models. Liang et al. arXiv 2022. [Paper][Project]
Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]
Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity. Zhuo et al. arXiv 2023. [Paper]

Human Evaluation

Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
Llama 2: Open foundation and fine-tuned chat models. Touvronet al. arXiv 2023. [Paper]
Rrhf: Rank responses to align language models with human feedback without tears. Yuan et al. arXiv 2023. [Paper][Data]
Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper]
Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]

Reward Model

Llama 2: Open foundation and fine-tuned chat models. Touvronet al. arXiv 2023. [Paper]
Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Checkpoint]
Direct preference optimization: Your language model is secretly a reward model. Rafailov et al. arXiv 2023. [Paper]
Raft: Reward ranked finetuning for generative foundation model alignment. Dong et al. arXiv 2023. [Paper]
Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. Ramamurthy et al. arXiv 2022. [Paper][Project]

Human Values

Benchmarks

Safety and Risk

Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
Safety Assessment of Chinese Large Language Models. Sun et al. arXiv 2023. [Paper][Data][Leaderboard]
SafeText: A benchmark for exploring physical safety in language models. Levy et al. arXiv 2022. [Paper][Data]
CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility. Xu et al. arXiv 2023. [Paper][Project][Data]

Social Norms

The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. Lourie et al. AAAI 2021. [Paper][Data]
Moral mimicry: Large language models produce moral rationalizations tailored to political identity. Simmons et al. arXiv 2022. [Paper]
When to make exceptions: Exploring language models as accounts of human moral judgment. Jin et al. Neurips 2022. [Paper][Project][Data]
Towards Answering Open-ended Ethical Quandary Questions. Bang et al. arXiv 2022. [Paper]

Value Surveys

Towards Measuring the Representation of Subjective Global Opinions in Language Models. Durmus et al. arXiv 2023. [Paper][Data]
Culture's consequences: International differences in work-related values. Hofstede et al. 1984. [Paper]
World Values Survey Wave 7 (2017-2022). [URL]
European Values Study. [URL]
Pew Researcj Center's Global Attitudes Surveys (GAS) [URL]
An overview of the Schwartz theory of basic values. Schwartz et al. Online readings in Psychology and Culture 2012. [Paper]
Probing pre-trained language models for cross-cultural differences in values. Arora et al. arXiv 2022. [Paper]

Reward Model

Learning norms from stories: A prior for value aligned agents. Nahian et al. AIES 2020. [Paper]
Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper]
Can machines learn morality? the delphi experiment. Jiang et al. arXiv 2021. [Paper][Project]
Valuenet: A new dataset for human value driven dialogue system. Qiu et al. AAAI 2022. [Paper][Project]
Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment. Hoover et al. Social Psychological and Personality Science 2020. [Paper]
Large pre-trained language models contain human-like biases of what is right and wrong to do. Schramowski et al. Nature Machine Intelligence 2022. [Paper]

Awesome

A Survey about Alignment Goals

Table of Contents

Human Instructions

Alignment Goal Achievement

Alignment Goal Evaluation

Benchmarks

Automatic Chatbot Arenas

Human Preferences

Human Demonstrations

Human Feedback

Model Synthetic Feedback

Human Values

Value Principles

HHH (Helpful & Honest & Harmless)

Social Norms & Ethics

Basic Value Theory

Target Representation

Desirable Behaviors

Intrinsic Values

Evaluation of Alignment Targets

Human Instructions

Benchmarks

Automatic Chatbot Arenas

Human Preferences

Benchmarks

Human Evaluation

Reward Model

Human Values

Benchmarks

Safety and Risk

Social Norms

Value Surveys

Reward Model