Home

Awesome

<p align="center"> <img src="assets/logo.png" width="150" style="margin-bottom: 0.2;"/> <p>

πŸ“ Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

<!-- Broader Real-World Applications --> <!-- # πŸ“ Marco-o1: An Open Large Reasoning Model for Real-World Solutions --> <!-- <h2 align="center"> <a href="https://github.com/AIDC-AI/Marco-o1/">Marco-o1</a></h2> --> <!-- <h5 align="center"> If you appreciate our project, please consider giving us a star ⭐ on GitHub to stay updated with the latest developments. </h2> --> <h4 align="center"> <!-- [![πŸ€—Hugging Face](https://img.shields.io/badge/πŸ€—Hugging_Face-Marco_o1-yellow)](https://huggingface.co/) [![Project Page](https://img.shields.io/badge/Project_Page-Marco_o1-blue)](https://github.com/AIDC-AI/Marco-o1/) --> <div align="center"> <img src="https://img.shields.io/badge/Version-1.0.0-blue.svg" alt="Version"> <img src="https://img.shields.io/badge/License-Apache%202.0-green.svg" alt="License"> <img src="https://img.shields.io/github/stars/AIDC-AI/Marco-o1?color=yellow" alt="Stars"> <img src="https://img.shields.io/github/issues/AIDC-AI/Marco-o1?color=red" alt="Issues"> <img src="https://img.shields.io/badge/python-3.8-purple.svg" alt="Python"> </h4> <div align="center"> <!-- **Affiliations:** -->

⭐ MarcoPolo Team ⭐

Alibaba International Digital Commerce

:octocat: Github πŸ€— Hugging Face πŸ“ Paper πŸ§‘β€πŸ’» Model πŸ—‚οΈ Data πŸ“½οΈ Demo

</div>

🎯 Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and codingβ€”which are well-suited for reinforcement learning (RL)β€”but also places greater emphasis on open-ended resolutions. We aim to address the question: "Can the o1 model effectively generalize to broader domains where clear standards are absent and rewards are challenging to quantify?"

Currently, Marco-o1 Large Language Model (LLM) is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and innovative reasoning strategiesβ€”optimized for complex real-world problem-solving tasks.

⚠️ Limitations: <ins>We would like to emphasize that this research work is inspired by OpenAI's o1 (from which the name is also derived). This work aims to explore potential approaches to shed light on the currently unclear technical roadmap for large reasoning models. Besides, our focus is on open-ended questions, and we have observed interesting phenomena in multilingual applications. However, we must acknowledge that the current model primarily exhibits o1-like reasoning characteristics and its performance still fall short of a fully realized "o1" model. This is not a one-time effort, and we remain committed to continuous optimization and ongoing improvement.</ins>

<div align="center"> </div> <div align="center"> <img src="assets/strawberry_2.jpg" alt="Figure Description or Alt Text" width="80%"> <p><strong>Figure 1: </strong> A classic 'strawberry' question reasoned by our Marco-o1 model: "How many 'r' are in strawberry". Although the answer is correct, the final letter 'y' is overlooked during CoT. This is an interesting finding, which is discussed in issue https://github.com/AIDC-AI/Marco-o1/issues/3.</p> </div>

πŸš€ Highlights

Currently, our work is distinguished by the following highlights:

πŸ”₯ News

<!-- ## Coming Soon --> <!-- This is our initial version, and we will continue to update and enhance the model's reasoning capabilities. -->
  1. [Coming Soon] πŸƒ Reward Models: We are working on training reward models, including Outcome Reward Modeling (ORM) and Process Reward Modeling (PRM), to provide a more accurate reward signal for MCTS. A more precise reward function will help reduce randomness in tree search results and improve overall performance.

  2. [Coming Soon] πŸƒ Reinforcement Learning: We are conducting reinforcement learning training to develop better reasoning models. By utilizing RL techniques, we aim to refine the model's decision-making processes and further enhance its problem-solving abilities.

πŸ”” Introduction

OpenAI recently introduced the groundbreaking o1 model, renowned for its exceptional reasoning capabilities. This model has demonstrated outstanding performance on platforms such as AIME and CodeForces, surpassing other leading models. Inspired by this success, we aimed to push the boundaries of LLMs even further, enhancing their reasoning abilities to tackle complex, real-world challenges.

🌍 Marco-o1 leverages advanced techniques like CoT fine-tuning, MCTS, and Reasoning Action Strategies to enhance its reasoning power. As shown in Figure 2, by fine-tuning Qwen2-7B-Instruct with a combination of the filtered Open-O1 CoT dataset, Marco-o1 CoT dataset, and Marco-o1 Instruction dataset, Marco-o1 improved its handling of complex tasks. MCTS allows exploration of multiple reasoning paths using confidence scores derived from softmax-applied log probabilities of the top-k alternative tokens, guiding the model to optimal solutions. Moreover, our reasoning action strategy involves varying the granularity of actions within steps and mini-steps to optimize search efficiency and accuracy.

<div align="center"> <img src="assets/intro_2.jpg" alt="Figure Description or Alt Text" width="80%"> <p><strong>Figure 2: </strong>The overview of Marco-o1.</p> </div>

🌏 As shown in Figure 3, Marco-o1 achieved accuracy improvements of +6.17% on the MGSM (English) dataset and +5.60% on the MGSM (Chinese) dataset, showcasing enhanced reasoning capabilities.

<div align="center"> <img src="assets/results.jpg" alt="Figure Description or Alt Text" width="80%"> <p><strong>Figure 3: </strong>The main results of Marco-o1.</p> </div>

🌎 Additionally, in translation tasks, we demonstrate that Marco-o1 excels in translating slang expressions, such as translating "θΏ™δΈͺιž‹ζ‹₯ζœ‰θΈ©ε±Žζ„Ÿ" (literal translation: "This shoe offers a stepping-on-poop sensation.") to "This shoe has a comfortable sole," demonstrating its superior grasp of colloquial nuances.

<div align="center"> <img src="assets/translation.jpg" alt="Figure Description or Alt Text" width="80%"> <p><strong>Figure 4: </strong>The demonstration of translation task using Marco-o1.</p> </div> <!-- ## 🎨 Case Show This is a classic example from our Marco-o1 model, "How many 'r's are in the word 'strawberry'?" Through multiple search steps, the correct answer was obtained, as shown in Figure 2. Although we tested general reasoning capabilities, our primary focus is on tackling challenging translation problems. An example of this focus is shown in Figure 3, illustrating the translation of the sentence "This shoe has a comfortable sole and is highly recommended for purchase." --> <!-- <div align="center"> <img src="assets/strawberry_2.jpg" alt="Figure Description or Alt Text" width="80%"> <p><strong>Figure 2: </strong>Reasoning example of "How many 'r' are in strawberry"</p> </div> -->

πŸ“š Marco Reasoning Datasets

To enhance the reasoning capabilities of the Marco-o1 model, we employed a SFT strategy using a variety of datasets.

Dataset#Samples
Open-O1 CoT Dataset (Filtered)45,125
Marco-o1 CoT Dataset (Synthetic)10,000
Marco Instruction Dataset5,141
Total60,266

πŸ“₯ Marco Reasoning Dataset (Our Partial Dataset)

🌟 Solution Space Expansion via MCTS

We integrated LLMs with MCTS to enhance the reasoning capabilities of our Marco-o1 model:

Furthermore, we obtain the value of each state by computing a confidence score using the following formulas:

  1. Confidence Score ($c_i$):

    For each token $t_i$ generated during the rollout, we calculate its confidence score by applying the softmax function to its log probability and the log probabilities of the top 5 alternative tokens. This is given by:

    $$c_i = \frac{\exp(p(t_i))}{\sum_{k=1}^{5} \exp(p(t_k))}$$

    where:

    • $c_i$ is the confidence score for the $i^{th}$ token in the rollout.
    • $p(t_i)$ is the log probability of the $i^{th}$ token generated by the LLM.
    • $p(t_k)$ for $k = 1$ to $5$ are the log probabilities of the top 5 predicted tokens at the $i^{th}$ step.
    • $n$ is the total number of tokens in the rollout sequence.

    This equation ensures that the confidence score reflects the relative probability of the chosen token compared to the top alternatives, effectively normalizing the scores between 0 and 1.

  2. Reward Score ($v$):

    After obtaining the confidence scores for all tokens in the rollout sequence, we compute the average confidence score across all tokens to derive the overall reward score:

    $$v = \frac{1}{n} \sum_{i=1}^{n} c_i$$

    where: $v$ is the overall reward score for the rollout path.

    This average serves as the reward signal that evaluates the quality of the reasoning path taken during the rollout. A higher $v$ indicates a more confident and likely accurate reasoning path.

By employing this method, we effectively expand the solution space, allowing the model to explore a vast array of reasoning paths and select the most probable ones based on calculated confidence scores.

🌟 Reasoning Action Strategy

✨ Action Selection

We observed that using actions as the granularity for MCTS search is relatively coarse, often causing the model to overlook nuanced reasoning paths crucial for solving complex problems. To address this, we explored different levels of granularity in the MCTS search. Initially, we used steps as the unit of search. To further expand the model's search space and enhance its problem-solving capabilities, we experimented with dividing these steps into smaller units of 64 or 32 tokens, referred to as "mini-step." This finer granularity allowed the model to explore reasoning paths in greater detail. While token-level search offers theoretical maximum flexibility and granularity, it is currently impractical due to the significant computational resources required and the challenges associated with designing an effective reward model at this level.

In our experiments, we implemented the following strategies within the MCTS framework:

✨ Reflection after Thinking

We introduced a reflection mechanism by adding the phrase "Wait! Maybe I made some mistakes! I need to rethink from scratch." at the end of each thought process. This prompts the model to self-reflect and reevaluate its reasoning steps. Implementing this reflection has yielded significant improvements, especially on difficult problems that the original model initially solved incorrectly. With the addition of reflection, approximately half of these challenging problems were answered correctly.

From the self-critic perspective, this approach allows the model to act as its own critic, identifying potential errors in its reasoning. By explicitly prompting the model to question its initial conclusions, we encourage it to re-express and refine its thought process. This self-critical mechanism leverages the model's capacity to detect inconsistencies or mistakes in its own output, leading to more accurate and reliable problem-solving. The reflection step serves as an internal feedback loop, enhancing the model's ability to self-correct without external intervention.

🎯 Experimental Results

Based on πŸ’‘ Qwen2-7B-Instruct, we performed SFT using our training data to create πŸ’‘ Marco-o1-CoT. Besides, we employed Marco-o1-CoT within the framework of MCTS tree search, differentiating by actions:

During testing, each model utilized a CoT prompt to ensure consistency in reasoning processes. We then tested these configurations on the English (En) and Chinese (Zh) subsets of the MGSM dataset, obtaining the following results:

ModelMGSM-En (Acc.)MGSM-Zh (Acc.)
Qwen2-7B-Instruct84.00%76.80%
Marco-o1-CoT85.60%71.20%
Marco-o1-MCTS (step)90.40%80.00%
Marco-o1-MCTS (mini-step of 64 tokens)88.40%80.40%
Marco-o1-MCTS (mini-step of 32 tokens)87.60%82.40%

πŸ“₯ Marco-o1-CoT (Our Lastest Model)

πŸ”¬ These results demonstrate that:

  1. Performance of Marco-o1-CoT vs. Qwen2-7B-Instruct:

    • In the MGSM-en dataset, Marco-o1-CoT shows an advantage over Qwen2-7B-Instruct, as shown in Figure 4, which is expected due to the fine-tuning with English CoT data.
    • In the MGSM-zh dataset, however, Marco-o1-CoT exhibits a decrease in performance compared to Qwen2-7B-Instruct. This decline is attributed to the fact that the CoT data used for fine-tuning was in English, which may not transfer effectively to the Chinese dataset.
  2. Impact of MCTS Integration:

    • The three MCTS-enhanced models demonstrate improvements over Marco-o1-CoT, indicating that incorporating MCTS helps to expand the model's solution space and increase the probability of obtaining correct answers.
    • However, since we use the Confidence Score as the reward, the tree search results exhibit significant randomness. In MGSM-en, the "step as Action" strategy performs the best, while in MGSM-zh, the "mini-step as Action (32)" strategy yields the highest accuracy.
    • Currently, as shown in Figure 5-6, we cannot draw definitive conclusions about which action strategy is superior. We believe that as the reward becomes more accurate, the larger solution space provided by MCTS will demonstrate greater potential.
<div align="center"> <img src="assets/cot-step.jpg" alt="Figure Description or Alt Text" width="100%"> <p><strong>Figure 4: </strong>MCTS Expands the Solution Space for Correct Answers. Comparison between Marco-o1-CoT (left) and Marco-o1-MCTS (step) (right) on the MGSM dataset. While Marco-o1-CoT failed to provide the correct answer, integrating MCTS with step-level actions allowed the model to explore a broader solution space, increasing the likelihood of arriving at the correct solution.</p> </div> <div align="center"> <img src="assets/step-ministep32.jpg" alt="Figure Description or Alt Text" width="100%"> <p><strong>Figure 5: </strong>Finer Granularity with mini-steps Enhances Problem-Solving. Comparison between Marco-o1-MCTS (step) (left) and Marco-o1-MCTS (mini-step of 32 tokens) (right) on the MGSM dataset. The step-level action strategy did not yield the correct answer, but by using a finer-grained mini-step of 32 tokens, the model successfully navigated the solution space to find the correct answer, demonstrating the effectiveness of increased action granularity.</p> </div> <div align="center"> <img src="assets/ministep64-step.jpg" alt="Figure Description or Alt Text" width="100%"> <p><strong>Figure 6: </strong>Optimal Action Granularity Depends on Problem Complexity. Comparison between Marco-o1-MCTS (mini-step of 64 tokens) (left) and Marco-o1-MCTS (step) (right) on the MGSM dataset. The model with a mini-step of 64 tokens failed to find the correct answer, whereas using step-level actions enabled the model to correctly solve the problem. This highlights that we cannot draw definitive conclusions about which action strategy is superior. We believe that as the reward becomes more accurate, the larger solution space provided by MCTS will demonstrate greater potential. </p> </div>

Furthermore, we use Test@N to denote the percentage of problems solved correctly at least once when allowing the model to make N separate guesses for each problem. We evaluated solve rates at Test@1, Test@8, and Test@32. The results demonstrate that MCTS shows an advantage with a lower number of separate guesses (Test@1). This reveals the potential of MCTS. In future work, we plan to train the reward model (RM) in conjunction with MCTS to continue optimizing our approach.

<div align="center"> <img src="assets/test@k.jpg" alt="Figure Description or Alt Text" width="80%"> </div>

These results demonstrate the effectiveness of our approach in enhancing the reasoning capabilities of the model across different languages and configurations.

Case Study of Translation Task

We have also conducted some open-ended tests on our models, such as translation issues and achieved some positive results. In the future, we will continue to explore other areas and improve the model's related performance.

<div align="center"> <img src="assets/trans-case-1.jpg" alt="Figure Description or Alt Text" width="60%"> <p><strong>Figure 7: </strong>Translation comparison of a colloquial expression β€œIt’s so beautiful that it’s captivating, the upper part has a distinctly Korean style, the soft and fluffy material is perfectly thick, and it’s complemented by a base layer, creating a unique and everyday-wear outfit”.</p> </div> <div align="center"> <img src="assets/trans-case-2.jpg" alt="Figure Description or Alt Text" width="60%"> <p><strong>Figure 8: </strong> Translation comparison of a colloquial expression β€œIt’s so beautiful! And it’s so cheap, super straight and doesn’t curl. Buy it, buy it!”.</p> </div>

⚑️ Released Resources

Models and Datasets

πŸ“₯ Marco-o1-CoT (Our Lastest Model)

πŸ“₯ Marco Reasoning Dataset (Our Partial Dataset)

Installation

To install Marco-o1, follow these steps:

# Clone the repository
git clone https://github.com/AIDC-AI/Marco-o1

# Change to the Macaw-LLM directory
cd Marco-o1

# Install required packages
pip install -r requirements.txt

Usage

  1. Load Marco-o1-CoT model:

    # Load model directly
    from transformers import AutoTokenizer, AutoModelForCausalLM
    
    tokenizer = AutoTokenizer.from_pretrained("AIDC-AI/Marco-o1")
    model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Marco-o1")
    
  2. Inference:

    Execute the inference script (you can give any customized inputs inside):

    ./src/talk_with_model.py
    
    # Use vLLM
    ./src/talk_with_model_vllm.py
    
  3. Deploy using FastAPI:

    Check the README.md file in examples folder.

πŸ‘¨πŸ»β€πŸ’» Acknowledgement

Main Contributors

From MarcoPolo Team, AI Business, Alibaba International Digital Commerce:

Citation

If you find Marco-o1 useful for your research and applications, please cite:

@misc{zhao2024marcoo1openreasoningmodels,
      title={Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions}, 
      author={Yu Zhao and Huifeng Yin and Bo Zeng and Hao Wang and Tianqi Shi and Chenyang Lyu and Longyue Wang and Weihua Luo and Kaifu Zhang},
      year={2024},
      eprint={2411.14405},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.14405}, 
}

LICENSE

This project is licensed under Apache License Version 2 (SPDX-License-identifier: Apache-2.0).

DISCLAIMER

We used compliance checking algorithms during the training process, to ensure the compliance of the trained model and dataset to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.