Home

Awesome

A Survey about Alignment Goals

A collection of papers and resources related to alignment for big models.

The organization of papers refers to our survey "From Instructions to Basic Human Values: A Survey of Alignment Goals for Big Models".

Table of Contents

<details> <ol> <li> <a href="#definition-of-alignment-goals">Definition of Alignment Goals</a> <ul> <li><a href="#human-instructions">Human Instructions</a></li> <li><a href="#human-preferences">Human Preferences</a></li> <ul> <li><a href="#human-demonstrations">Human Demonstrations</a></li> <li><a href="#human-feedback">Human Feedback</a></li> <li><a href="#model-synthetic-feedback">Model Synthetic Feedback</a></li> </ul> <li><a href="#human-values">Human Values</a></li> <ul> <li><a href="#value-principles">Value Principles</a></li> <li><a href="#target-representation">Target Representation</a></li> </ul> </ul> </li> <li> <a href="#evaluation-of-alignment">Evaluation of Alignment</a> <ul> <li><a href="#human-instructions-1">Human Instructions</a></li> <ul> <li><a href="#benchmarks">Benchmarks</a></li> <li><a href="#automatic-chatbot-arenas">Automatic Chatbot Arenas</a></li> </ul> <li><a href="#human-preferences-1">Human Preferences</a></li> <ul> <li><a href="#benchmarks-1">Benchmarks</a></li> <li><a href="#human-evaluation">Human Evaluation</a></li> <li><a href="#reward-model">Reward Model</a></li> </ul> <li><a href="#human-values-1">Human Values</a></li> <ul> <li><a href="#benchmarks-2">Benchmarks</a></li> <li><a href="#reward-model-1">Reward Model</a></li> </ul> </ul> <li><a href="#citation">Citation</a></li> </ol> </details>

Human Instructions

Motivation and Definition

Alignment Goal Achievement

  1. Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
  2. Cross-task generalization via natural language crowdsourcing instructions. Mishra et al. arXiv 2021. [Paper][Data][Project]
  3. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
  4. Glm-130b: An open bilingual pre-trained model. Zeng et al. arXiv 2022. [Paper][Project]
  5. Crosslingual generalization through multitask finetuning. Muennighoff et al. arXiv 2022. [Paper][Project]
  6. Unnatural instructions: Tuning language models with (almost) no human labor. Honovich et al. arXiv 2022. [Paper][Data]
  7. Self-instruct: Aligning language model with self generated instructions. Wang et al. arXiv 2022. [Paper][Data]
  8. Scaling instruction-finetuned language models. Chung et al. arXiv 2021. [Paper]
  9. The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
  10. Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
  11. Stanford alpaca: An instruction-following llama model. Taori et al. 2023 [Blog][Data]
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality. Chiang et al. See https://vicuna 2023. [Paper][Project][Data]
  13. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. Xu et al. arXiv 2023. [Paper][Data]
  14. Improving multimodal interactive agents with reinforcement learning from human feedback. Abramson et al. arXiv 2022. [Paper]
  15. Aligning text-to-image models using human feedback. Lee et al. arXiv 2023. [Paper]
  16. Visual instruction tuning. Liu et al. arXiv 2023. [Paper][Project]
  17. Llavar: Enhanced visual instruction tuning for text-rich image understanding. Zhang et al. arXiv 2023. [Paper][Project]

Alignment Goal Evaluation

Benchmarks

  1. Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
  2. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
  3. The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
  4. Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
  5. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
  6. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Huang et al. arXiv 2023. [Paper][Project]
  7. Agieval: A human-centric benchmark for evaluating foundation models. Zhong et al. arXiv 2023. [Paper][Project]
  8. Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]

Automatic Chatbot Arenas

  1. Alpacaeval: An automatic evaluator of instruction-following models. Li et al. 2023. [Project]
  2. Alpacafarm: A simulation framework for methods that learn from human feedback. Dubois et al. arXiv 2023. [Paper][Project]
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality*. Chiang et al. See https://vicuna 2023. [Paper][Project]
  4. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Zheng et al. arXiv 2023. [Paper][Project]

Human Preferences

Human Demonstrations

  1. Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
  2. Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
  3. Recursively summarizing books with human feedback. Wu et al. arXiv 2021. [Paper][Data]
  4. Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
  5. OpenAssistant Conversations--Democratizing Large Language Model Alignment. Kopf et al. arXiv 2023. [Paper][Project][Data][Checkpoint]
  6. Reward design with language models. Kwon et al. arXiv 2023. [Paper]

Human Feedback

  1. Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
  2. Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
  3. Recursively summarizing books with human feedback. Wu et al. arXiv 2021. [Paper][Data]
  4. Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
  5. OpenAssistant Conversations--Democratizing Large Language Model Alignment. Kopf et al. arXiv 2023. [Paper][Project][Data][Checkpoint]
  6. Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. Wu et al. arXiv 2023. [Paper][Project][Data]

Model Synthetic Feedback

  1. Reward design with language models. Kwon et al. arXiv 2023. [Paper]
  2. Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]
  3. Training Socially Aligned Language Models in Simulated Human Society. Liu et al. arXiv 2023. [Paper][Project]
  4. Training Language Models with Language Feedback at Scale. Jeremy Scheurer et al. arXiv 2023. [Paper][Data][Project]
  5. Visual Instruction Tuning with Polite Flamingo. Chen et al. arXiv 2023. [Paper][Project][Data]

Human Values

Value Principles

HHH (Helpful & Honest & Harmless)
  1. Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
  2. A general language assistant as a laboratory for alignment. Askell et al. arXiv 2021. [Paper]
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
  4. Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
  5. Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
  6. Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Project][Data]
  7. Process for adapting language models to society (palms) with values-targeted datasets. Solaiman et al. Neurips 2021. [Paper]
Social Norms & Ethics
  1. The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
  2. Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
  3. Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
  4. Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
  5. Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. Lourie et al. AAAI 2021. [Paper][Data]
  6. MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions. Sun et al. ACL 2023. [Paper][Project]
  7. Learning norms from stories: A prior for value aligned agents. Nahian et al. AIES 2020. [Paper]
Basic Value Theory
  1. An overview of the Schwartz theory of basic values. Schwartz et al. Online readings in Psychology and Culture 2012. [Paper]
  2. Rokeach value survey. Rokeach et al. The nature of human values. 1967. [Paper]
  3. Life values inventory: Facilitator's guide. Brown et al. Willianmsburg, VA 2002. [Paper]
  4. Moral foundations theory: The pragmatic validity of moral pluralism. Graham et al. Advances in experimental social psychology, 2013. [Paper]

Target Representation

Desirable Behaviors
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
  2. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. Ganguliet al. arXiv 2022. [Paper][Data]
  3. Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
  4. Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
  5. Social bias frames: Reasoning about social and power implications of language. Sap et al. arXiv 2019. [Paper][Data]
Intrinsic Values
  1. Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
  2. Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
  3. Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Data]
  4. Process for adapting language models to society (palms) with values-targeted datasets. Solaiman et al. Neurips 2021. [Paper]
  5. The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
  6. Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
  7. Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
  8. Can machines learn morality? the delphi experiment. Jiang et al. arXiv 2021. [Paper][Project]

Evaluation of Alignment Targets

Human Instructions

Benchmarks

  1. Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
  2. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
  3. The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
  4. Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
  5. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
  6. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Huang et al. arXiv 2023. [Paper][Project]
  7. Agieval: A human-centric benchmark for evaluating foundation models. Zhong et al. arXiv 2023. [Paper][Project]
  8. Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]

Automatic Chatbot Arenas

  1. Alpacaeval: An automatic evaluator of instruction-following models. Li et al. 2023. [Project]
  2. Alpacafarm: A simulation framework for methods that learn from human feedback. Dubois et al. arXiv 2023. [Paper][Project]
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality*. Chiang et al. See https://vicuna 2023. [Paper][Project]
  4. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Zheng et al. arXiv 2023. [Paper][Project]

Human Preferences

Benchmarks

  1. TruthfulQA: Measuring how models mimic human falsehoods. Lin et al. arXiv 2022. [Paper][Data]
  2. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Mihaylov et al. EMNLP 2018. [Paper][Data]
  3. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. Nangia et al. arXiv 2020. [Paper][Data]
  4. Gender bias in coreference resolution. Rudinger et al. arXiv 2018. [Paper][Data]
  5. BBQ: A hand-built bias benchmark for question answering. Parrish et al. arXiv 2021. [Paper][Data]
  6. Bold: Dataset and metrics for measuring biases in open-ended language generation. Dhamala et al. FAccT 2021. [Paper][Data]
  7. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. Gehman et al. arXiv 2020. [Paper][Data]
  8. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. Hartvigsen et al. arXiv 2022. [Paper][Data]
  9. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
  10. Holistic evaluation of language models. Liang et al. arXiv 2022. [Paper][Project]
  11. Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]
  12. Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity. Zhuo et al. arXiv 2023. [Paper]

Human Evaluation

  1. Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
  2. Llama 2: Open foundation and fine-tuned chat models. Touvronet al. arXiv 2023. [Paper]
  3. Rrhf: Rank responses to align language models with human feedback without tears. Yuan et al. arXiv 2023. [Paper][Data]
  4. Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper]
  5. Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]

Reward Model

  1. Llama 2: Open foundation and fine-tuned chat models. Touvronet al. arXiv 2023. [Paper]
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Checkpoint]
  3. Direct preference optimization: Your language model is secretly a reward model. Rafailov et al. arXiv 2023. [Paper]
  4. Raft: Reward ranked finetuning for generative foundation model alignment. Dong et al. arXiv 2023. [Paper]
  5. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. Ramamurthy et al. arXiv 2022. [Paper][Project]

Human Values

Benchmarks

Safety and Risk
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
  2. Safety Assessment of Chinese Large Language Models. Sun et al. arXiv 2023. [Paper][Data][Leaderboard]
  3. SafeText: A benchmark for exploring physical safety in language models. Levy et al. arXiv 2022. [Paper][Data]
  4. CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility. Xu et al. arXiv 2023. [Paper][Project][Data]
Social Norms
  1. The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
  2. Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
  3. Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
  4. Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
  5. Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. Lourie et al. AAAI 2021. [Paper][Data]
  6. Moral mimicry: Large language models produce moral rationalizations tailored to political identity. Simmons et al. arXiv 2022. [Paper]
  7. When to make exceptions: Exploring language models as accounts of human moral judgment. Jin et al. Neurips 2022. [Paper][Project][Data]
  8. Towards Answering Open-ended Ethical Quandary Questions. Bang et al. arXiv 2022. [Paper]
Value Surveys
  1. Towards Measuring the Representation of Subjective Global Opinions in Language Models. Durmus et al. arXiv 2023. [Paper][Data]
  2. Culture's consequences: International differences in work-related values. Hofstede et al. 1984. [Paper]
  3. World Values Survey Wave 7 (2017-2022). [URL]
  4. European Values Study. [URL]
  5. Pew Researcj Center's Global Attitudes Surveys (GAS) [URL]
  6. An overview of the Schwartz theory of basic values. Schwartz et al. Online readings in Psychology and Culture 2012. [Paper]
  7. Probing pre-trained language models for cross-cultural differences in values. Arora et al. arXiv 2022. [Paper]

Reward Model

  1. Learning norms from stories: A prior for value aligned agents. Nahian et al. AIES 2020. [Paper]
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper]
  3. Can machines learn morality? the delphi experiment. Jiang et al. arXiv 2021. [Paper][Project]
  4. Valuenet: A new dataset for human value driven dialogue system. Qiu et al. AAAI 2022. [Paper][Project]
  5. Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment. Hoover et al. Social Psychological and Personality Science 2020. [Paper]
  6. Large pre-trained language models contain human-like biases of what is right and wrong to do. Schramowski et al. Nature Machine Intelligence 2022. [Paper]