Awesome
Awesome LLM Reasoning Openai-o1 Survey
The related works and background techniques about OpenAI o1, including LLM reasoning, self-play reinforcement learning, complex logic reasoning, scaling law, etc.
Introduction
Survey Papers
- A Survey on Self-play Methods in Reinforcement Learning [Paper] (2024)
- Ruize Zhang, Zelai Xu, Chengdong Ma, Chao Yu, Wei-Wei Tu, Shiyu Huang, Deheng Ye, Wenbo Ding, Yaodong Yang, Yu Wang
- Tencent, Tsinghua
Related Papers
Complex Logical Reasoning
- Generative Language Modeling for Automated Theorem Proving [Paper] (2020)
- Stanislas Polu, Ilya Sutskever
- OpenAI
- Hypothesis Search: Inductive Reasoning with Language Models [Paper] (ICLR 2024)
- Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, Noah D. Goodman
- Stanford, Autodesk Research
- Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement [Paper] (ICLR 2024)
- Linlu Qiu, Liwei Jiang, Ximing Lu, Melanie Sclar, Valentina Pyatkin, Chandra Bhagavatula, Bailin Wang, Yoon Kim, Yejin Choi, Nouha Dziri, Xiang Ren
- MIT, Allen AI, UW, USC
- Training Verifiers to Solve Math Word Problems [Paper] (2021)
- Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman
- OpenAI
- To CoT or not to CoT? Chain-of-thought Helps Mainly on Math and Symbolic Reasoning [Paper] (2024.9)
- Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett
- The University of Texas at Austin, Johns Hopkins University, Princeton University
Reasoning Bootstrapping
- STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning [Paper] [Github] (NeurIPS 2022)
- Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman
- Stanford, Google
- Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [Paper] [Github] (2022)
- Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, Noah D. Goodman
- Stanford, Notbad AI
- Training Chain-of-thought via Latent-variable Inference [Paper] (NeurIPS 2023)
- Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous
- Chain-of-thought Reasoning without Prompting [Paper] (2024)
- Xuezhi Wang, Denny Zhou
- Google DeepMind
- Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [Paper] [Github] (2024)
- Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, Mao Yang
- MSRA, Harvard University
Reasoning Scaling Law
- Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [Paper] (2024)
- Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, Azalia Mirhoseini
- Stanford, Oxford, Google DeepMind
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [Paper] (2024)
- Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar
- UC Berkeley, Google DeepMind
- An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models [Paper] (2024)
- Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, Yiming Yang
- Tsinghua, CMU
- Training Language Models to Self-Correct via Reinforcement Learning [Paper] (2024)
- Evan Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, Will Song, Vaskar Nath, Ziwen Han, Sean Hendryx, Summer Yue, Hugh Zhang
- Google DeepMind
- From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond [[https://arxiv.org/abs/2411.03590]] (2024)
- Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz
- Microsoft, OpenAI
Self-play Learning
- Mastering Chess and Shogi by Self-play with a General Reinforcement Learning Algorithm [Paper] (2017)
- David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez,Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis
- Google DeepMind
- Language Models Can Teach Themselves to Program Better [Paper] [Github] (ICLR 2023)
- Patrick Haluptzok, Matthew Bowers, Adam Tauman Kalai
- Microsoft Research, MIT
- Large Language Models Can Self-Improve [Paper]
- Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han
- University of Illinois at Urbana-Champaign, Google
- Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [Paper] [Github] (ICML 2024)
- Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu
- UCLA
- Self-Play Preference Optimization for Language Model Alignment [Paper] [Github] (2024)
- Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, Quanquan Gu
- UCLA
- Scalable Online Planning via Reinforcement Learning Fine-Tuning [Paper] (NeurIPS 2021)
- Arnaud Fickinger, Hengyuan Hu, Brandon Amos, Stuart Russell, Noam Brown
- Generative Verifiers: Reward Modeling as Next-Token Prediction [Paper] (2024)
- Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal
- Google DeepMind
- Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [Paper] (2024)
- Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang
- Fudan University, Shanghai AI Lab
- Interpretable Contrastive Monte Carlo Tree Search Reasoning [Paper] (2024)
- Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu, Hongzhang Liu, Aiwei Liu, Xuming Hu, Lijie Wen
- The University of Sydney, Peking University, Xiaohongshu, Shanghai AI Lab, Tsinghua, HKUST
Step-wise and Process-based Optimization
- Solving Math Word Problems with Process-and Outcome-based Feedback [Paper] (2022)
- Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia, Creswell, Geoffrey Irving, Irina Higgins
- Google DeepMind
- Thinking Fast and Slow With Deep Learning and Tree Search [Paper] (NeurIPS 2017)
- Thomas Anthony, Zheng Tian, David Barber
- University College Londo, Alen
- Let’s Verify Step by Step [Paper] (2023)
- Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe
- OpenAI
- LLM Critics Help Catch LLM Bugs [Paper] (2024)
- Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, Jan Leike
- OpenAI
- Self-critiquing Models for Assisting Human Evaluators [Paper] (2022)
- William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, Jan Leike
- OpenAI
- Improve Mathematical Reasoning in Language Models by Automated Process Supervision [Paper] (2024)
- Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, Abhinav Rastogi
- Google DeepMind
- Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [Paper] (2024)
- Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, Bo An
- Skywork AI, NTU
- Math-shepherd: Verify and Reinforce LLMs step-by-step without Human Annotations [Paper] (ACL 2024)
- Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, Zhifang Sui
- Peking University, DeepSeek AI, HKU, Tsinghua University, The Ohio State University
Social News
Open-source Projects
Communication Groups
Contributions
We welcome every researcher who contributes to this repository.