Awesome
A Survey about Alignment Goals
A collection of papers and resources related to alignment for big models.
The organization of papers refers to our survey "From Instructions to Basic Human Values: A Survey of Alignment Goals for Big Models".
Table of Contents
<details> <ol> <li> <a href="#definition-of-alignment-goals">Definition of Alignment Goals</a> <ul> <li><a href="#human-instructions">Human Instructions</a></li> <li><a href="#human-preferences">Human Preferences</a></li> <ul> <li><a href="#human-demonstrations">Human Demonstrations</a></li> <li><a href="#human-feedback">Human Feedback</a></li> <li><a href="#model-synthetic-feedback">Model Synthetic Feedback</a></li> </ul> <li><a href="#human-values">Human Values</a></li> <ul> <li><a href="#value-principles">Value Principles</a></li> <li><a href="#target-representation">Target Representation</a></li> </ul> </ul> </li> <li> <a href="#evaluation-of-alignment">Evaluation of Alignment</a> <ul> <li><a href="#human-instructions-1">Human Instructions</a></li> <ul> <li><a href="#benchmarks">Benchmarks</a></li> <li><a href="#automatic-chatbot-arenas">Automatic Chatbot Arenas</a></li> </ul> <li><a href="#human-preferences-1">Human Preferences</a></li> <ul> <li><a href="#benchmarks-1">Benchmarks</a></li> <li><a href="#human-evaluation">Human Evaluation</a></li> <li><a href="#reward-model">Reward Model</a></li> </ul> <li><a href="#human-values-1">Human Values</a></li> <ul> <li><a href="#benchmarks-2">Benchmarks</a></li> <li><a href="#reward-model-1">Reward Model</a></li> </ul> </ul> <li><a href="#citation">Citation</a></li> </ol> </details>Human Instructions
Motivation and Definition
Alignment Goal Achievement
- Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
- Cross-task generalization via natural language crowdsourcing instructions. Mishra et al. arXiv 2021. [Paper][Data][Project]
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
- Glm-130b: An open bilingual pre-trained model. Zeng et al. arXiv 2022. [Paper][Project]
- Crosslingual generalization through multitask finetuning. Muennighoff et al. arXiv 2022. [Paper][Project]
- Unnatural instructions: Tuning language models with (almost) no human labor. Honovich et al. arXiv 2022. [Paper][Data]
- Self-instruct: Aligning language model with self generated instructions. Wang et al. arXiv 2022. [Paper][Data]
- Scaling instruction-finetuned language models. Chung et al. arXiv 2021. [Paper]
- The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
- Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
- Stanford alpaca: An instruction-following llama model. Taori et al. 2023 [Blog][Data]
- Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality. Chiang et al. See https://vicuna 2023. [Paper][Project][Data]
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data. Xu et al. arXiv 2023. [Paper][Data]
- Improving multimodal interactive agents with reinforcement learning from human feedback. Abramson et al. arXiv 2022. [Paper]
- Aligning text-to-image models using human feedback. Lee et al. arXiv 2023. [Paper]
- Visual instruction tuning. Liu et al. arXiv 2023. [Paper][Project]
- Llavar: Enhanced visual instruction tuning for text-rich image understanding. Zhang et al. arXiv 2023. [Paper][Project]
Alignment Goal Evaluation
Benchmarks
- Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
- The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
- Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Huang et al. arXiv 2023. [Paper][Project]
- Agieval: A human-centric benchmark for evaluating foundation models. Zhong et al. arXiv 2023. [Paper][Project]
- Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]
Automatic Chatbot Arenas
- Alpacaeval: An automatic evaluator of instruction-following models. Li et al. 2023. [Project]
- Alpacafarm: A simulation framework for methods that learn from human feedback. Dubois et al. arXiv 2023. [Paper][Project]
- Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality*. Chiang et al. See https://vicuna 2023. [Paper][Project]
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Zheng et al. arXiv 2023. [Paper][Project]
Human Preferences
Human Demonstrations
- Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
- Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
- Recursively summarizing books with human feedback. Wu et al. arXiv 2021. [Paper][Data]
- Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
- OpenAssistant Conversations--Democratizing Large Language Model Alignment. Kopf et al. arXiv 2023. [Paper][Project][Data][Checkpoint]
- Reward design with language models. Kwon et al. arXiv 2023. [Paper]
Human Feedback
- Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
- Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper][Project][Data]
- Recursively summarizing books with human feedback. Wu et al. arXiv 2021. [Paper][Data]
- Webgpt: Browser-assisted question-answering with human feedback. Nakano et al. arXiv 2021. [Paper][Data]
- OpenAssistant Conversations--Democratizing Large Language Model Alignment. Kopf et al. arXiv 2023. [Paper][Project][Data][Checkpoint]
- Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. Wu et al. arXiv 2023. [Paper][Project][Data]
Model Synthetic Feedback
- Reward design with language models. Kwon et al. arXiv 2023. [Paper]
- Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]
- Training Socially Aligned Language Models in Simulated Human Society. Liu et al. arXiv 2023. [Paper][Project]
- Training Language Models with Language Feedback at Scale. Jeremy Scheurer et al. arXiv 2023. [Paper][Data][Project]
- Visual Instruction Tuning with Polite Flamingo. Chen et al. arXiv 2023. [Paper][Project][Data]
Human Values
Value Principles
HHH (Helpful & Honest & Harmless)
- Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
- A general language assistant as a laboratory for alignment. Askell et al. arXiv 2021. [Paper]
- Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
- Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
- Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
- Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Project][Data]
- Process for adapting language models to society (palms) with values-targeted datasets. Solaiman et al. Neurips 2021. [Paper]
Social Norms & Ethics
- The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
- Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
- Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
- Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
- Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. Lourie et al. AAAI 2021. [Paper][Data]
- MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions. Sun et al. ACL 2023. [Paper][Project]
- Learning norms from stories: A prior for value aligned agents. Nahian et al. AIES 2020. [Paper]
Basic Value Theory
- An overview of the Schwartz theory of basic values. Schwartz et al. Online readings in Psychology and Culture 2012. [Paper]
- Rokeach value survey. Rokeach et al. The nature of human values. 1967. [Paper]
- Life values inventory: Facilitator's guide. Brown et al. Willianmsburg, VA 2002. [Paper]
- Moral foundations theory: The pragmatic validity of moral pluralism. Graham et al. Advances in experimental social psychology, 2013. [Paper]
Target Representation
Desirable Behaviors
- Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. Ganguliet al. arXiv 2022. [Paper][Data]
- Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
- Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
- Social bias frames: Reasoning about social and power implications of language. Sap et al. arXiv 2019. [Paper][Data]
Intrinsic Values
- Improving alignment of dialogue agents via targeted human judgements. Glaese et al. arXiv 2022. [Paper][Data]
- Constitutional ai: Harmlessness from ai feedback. Bai et al. arXiv 2022. [Paper][Data]
- Principle-driven self-alignment of language models from scratch with minimal human supervision. Sun et al. arXiv 2023. [Paper][Data]
- Process for adapting language models to society (palms) with values-targeted datasets. Solaiman et al. Neurips 2021. [Paper]
- The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
- Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
- Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
- Can machines learn morality? the delphi experiment. Jiang et al. arXiv 2021. [Paper][Project]
Evaluation of Alignment Targets
Human Instructions
Benchmarks
- Multitask prompted training enables zero-shot task generalizatio. Sanh et al. arXiv 2021. [Paper][Checkpoint][Data]
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. Wang et al. arXiv 2022. [Paper][Data][Project]
- The flan collection: Designing data and methods for effective instruction tuning. Longpre et al. arXiv 2023. [Paper][Data]
- Opt-IML: Scaling language model instruction meta learning through the lens of generalization. Iyer et al. arXiv 2022. [Paper]
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Huang et al. arXiv 2023. [Paper][Project]
- Agieval: A human-centric benchmark for evaluating foundation models. Zhong et al. arXiv 2023. [Paper][Project]
- Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]
Automatic Chatbot Arenas
- Alpacaeval: An automatic evaluator of instruction-following models. Li et al. 2023. [Project]
- Alpacafarm: A simulation framework for methods that learn from human feedback. Dubois et al. arXiv 2023. [Paper][Project]
- Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality*. Chiang et al. See https://vicuna 2023. [Paper][Project]
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Zheng et al. arXiv 2023. [Paper][Project]
Human Preferences
Benchmarks
- TruthfulQA: Measuring how models mimic human falsehoods. Lin et al. arXiv 2022. [Paper][Data]
- Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Mihaylov et al. EMNLP 2018. [Paper][Data]
- CrowS-pairs: A challenge dataset for measuring social biases in masked language models. Nangia et al. arXiv 2020. [Paper][Data]
- Gender bias in coreference resolution. Rudinger et al. arXiv 2018. [Paper][Data]
- BBQ: A hand-built bias benchmark for question answering. Parrish et al. arXiv 2021. [Paper][Data]
- Bold: Dataset and metrics for measuring biases in open-ended language generation. Dhamala et al. FAccT 2021. [Paper][Data]
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. Gehman et al. arXiv 2020. [Paper][Data]
- Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. Hartvigsen et al. arXiv 2022. [Paper][Data]
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. arXiv 2022. [Paper][Project]
- Holistic evaluation of language models. Liang et al. arXiv 2022. [Paper][Project]
- Discovering language model behaviors with model-written evaluations. Perez et al. arXiv 2022. [Paper][Project]
- Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity. Zhuo et al. arXiv 2023. [Paper]
Human Evaluation
- Training language models to follow instructions with human feedback. Ouyang et al. Neurips 2022. [Paper]
- Llama 2: Open foundation and fine-tuned chat models. Touvronet al. arXiv 2023. [Paper]
- Rrhf: Rank responses to align language models with human feedback without tears. Yuan et al. arXiv 2023. [Paper][Data]
- Learning to summarize with human feedback. Stiennon et al. Neurips 2020. [Paper]
- Aligning Large Language Models through Synthetic Feedback. Kim et al. arXiv 2023. [Paper]
Reward Model
- Llama 2: Open foundation and fine-tuned chat models. Touvronet al. arXiv 2023. [Paper]
- Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Checkpoint]
- Direct preference optimization: Your language model is secretly a reward model. Rafailov et al. arXiv 2023. [Paper]
- Raft: Reward ranked finetuning for generative foundation model alignment. Dong et al. arXiv 2023. [Paper]
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. Ramamurthy et al. arXiv 2022. [Paper][Project]
Human Values
Benchmarks
Safety and Risk
- Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper][Data]
- Safety Assessment of Chinese Large Language Models. Sun et al. arXiv 2023. [Paper][Data][Leaderboard]
- SafeText: A benchmark for exploring physical safety in language models. Levy et al. arXiv 2022. [Paper][Data]
- CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility. Xu et al. arXiv 2023. [Paper][Project][Data]
Social Norms
- The moral integrity corpus: A benchmark for ethical dialogue systems. Ziems et al. arXiv 2022. [Paper][Data]
- Social chemistry 101: Learning to reason about social and moral norms. Forbes et al. arXiv 2020. [Paper][Data]
- Moral stories: Situated reasoning about norms, intents, actions, and their consequences. Emelin et al. arXiv 2020. [Paper][Data]
- Aligning ai with shared human values. Hendrycks et al. arXiv 2020. [Paper][Data]
- Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. Lourie et al. AAAI 2021. [Paper][Data]
- Moral mimicry: Large language models produce moral rationalizations tailored to political identity. Simmons et al. arXiv 2022. [Paper]
- When to make exceptions: Exploring language models as accounts of human moral judgment. Jin et al. Neurips 2022. [Paper][Project][Data]
- Towards Answering Open-ended Ethical Quandary Questions. Bang et al. arXiv 2022. [Paper]
Value Surveys
- Towards Measuring the Representation of Subjective Global Opinions in Language Models. Durmus et al. arXiv 2023. [Paper][Data]
- Culture's consequences: International differences in work-related values. Hofstede et al. 1984. [Paper]
- World Values Survey Wave 7 (2017-2022). [URL]
- European Values Study. [URL]
- Pew Researcj Center's Global Attitudes Surveys (GAS) [URL]
- An overview of the Schwartz theory of basic values. Schwartz et al. Online readings in Psychology and Culture 2012. [Paper]
- Probing pre-trained language models for cross-cultural differences in values. Arora et al. arXiv 2022. [Paper]
Reward Model
- Learning norms from stories: A prior for value aligned agents. Nahian et al. AIES 2020. [Paper]
- Training a helpful and harmless assistant with reinforcement learning from human feedback. Bai et al. arXiv 2022. [Paper]
- Can machines learn morality? the delphi experiment. Jiang et al. arXiv 2021. [Paper][Project]
- Valuenet: A new dataset for human value driven dialogue system. Qiu et al. AAAI 2022. [Paper][Project]
- Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment. Hoover et al. Social Psychological and Personality Science 2020. [Paper]
- Large pre-trained language models contain human-like biases of what is right and wrong to do. Schramowski et al. Nature Machine Intelligence 2022. [Paper]