Home

Awesome

InternLM-Math

<div align="center"> <img src="https://raw.githubusercontent.com/InternLM/InternLM/main/assets/logo.svg" width="200"/> <div> </div> <div align="center"> <b><font size="5">InternLM-Math</font></b> <sup> <a href="https://internlm.intern-ai.org.cn/"> <i><font size="4">HOT</font></i> </a> </sup> <div> </div> </div>

license

State-of-the-art bilingual open-sourced Math reasoning LLMs. A solver, prover, verifier, augmentor.

📑 Paper 💻 Github 🤗 Demo 🤗 Checkpoints OpenXLab <img src="./assets/modelscope_logo.png" width="20px" /> ModelScope

</div>

News

InternLM2-Math-Plus

Checkpoints

ModelModel TypeTransformers(HF)ModelScopeRelease Date
InternLM2-Math-Plus-1.8BChat🤗internlm/internlm2-math-plus-1_8bShanghai_AI_Laboratory/internlm2-math-plus-1_8b2024-05-27
InternLM2-Math-Plus-7BChat🤗internlm/internlm2-math-plus-7bShanghai_AI_Laboratory/internlm2-math-plus-7b2024-05-27
InternLM2-Math-Plus-20BChat🤗internlm/internlm2-math-plus-20bShanghai_AI_Laboratory/internlm2-math-plus-20b2024-05-27
InternLM2-Math-Plus-Mixtral8x22BChat🤗internlm/internlm2-math-plus-mixtral8x22bShanghai_AI_Laboratory/internlm2-math-plus-mixtral8x22b2024-05-27

Formal Math Reasoning

We evaluate the performance of InternLM2-Math-Plus on formal math reasoning benchmark MiniF2F-test. The evaluation setting is same as Llemma with LEAN 4.

This is how to reproduce our performance on MiniF2F.

ModelsMiniF2F-test
ReProver26.5
LLMStep27.9
GPT-F36.6
HTPS41.0
Llemma-7B26.2
Llemma-34B25.8
InternLM2-Math-7B-Base30.3
InternLM2-Math-20B-Base29.5
InternLM2-Math-Plus-1.8B38.9
InternLM2-Math-Plus-7B43.4
InternLM2-Math-Plus-20B42.6
InternLM2-Math-Plus-Mixtral8x22B37.3

Informal Math Reasoning

We evaluate the performance of InternLM2-Math-Plus on informal math reasoning benchmark MATH and GSM8K. InternLM2-Math-Plus-1.8B outperforms MiniCPM-2B in the smallest size setting. InternLM2-Math-Plus-7B outperforms Deepseek-Math-7B-RL which is the state-of-the-art math reasoning open source model. InternLM2-Math-Plus-Mixtral8x22B achieves 68.5 on MATH (with Python) and 91.8 on GSM8K.

For tool-calling inference and evaluation, please see the agent section.

ModelMATHMATH-PythonGSM8K
MiniCPM-2B10.2-53.8
InternLM2-Math-Plus-1.8B37.041.558.8
InternLM2-Math-7B34.650.978.1
Deepseek-Math-7B-RL51.758.888.2
InternLM2-Math-Plus-7B53.059.785.8
InternLM2-Math-20B37.754.382.6
InternLM2-Math-Plus-20B53.861.887.7
Mixtral8x22B-Instruct-v0.141.8-78.6
Eurux-8x22B-NCA49.0--
InternLM2-Math-Plus-Mixtral8x22B58.168.591.8

We also evaluate models on MathBench-A. InternLM2-Math-Plus-Mixtral8x22B has comparable performance compared to Claude 3 Opus.

ModelArithmeticPrimaryMiddleHighCollegeAverage
GPT-4o-051377.787.776.359.054.070.9
Claude 3 Opus85.785.058.042.743.763.0
Qwen-Max-042872.386.365.045.027.359.2
Qwen-1.5-110B70.382.364.047.328.058.4
Deepseek-V282.789.359.039.329.359.9
Llama-3-70B-Instruct70.386.053.038.734.756.5
InternLM2-Math-Plus-Mixtral8x22B77.582.063.650.336.862.0
InternLM2-Math-20B58.770.043.724.712.742.0
InternLM2-Math-Plus-20B65.879.759.547.624.855.5
Llama3-8B-Instruct54.771.025.019.014.036.7
InternLM2-Math-7B53.767.041.318.38.037.7
Deepseek-Math-7B-RL68.083.344.333.023.050.3
InternLM2-Math-Plus-7B61.478.352.540.521.750.9
MiniCPM-2B49.351.718.08.73.726.3
InternLM2-Math-Plus-1.8B43.043.325.418.94.727.1

Introduction (For InternLM2-Math)

math256 hungarian

Models

InternLM2-Math-Base-7B and InternLM2-Math-Base-20B are pretrained checkpoints. InternLM2-Math-7B and InternLM2-Math-20B are SFT checkpoints.

ModelModel TypeTransformers(HF)OpenXLabModelScopeRelease Date
InternLM2-Math-Base-7BBase🤗internlm/internlm2-math-base-7bOpen in OpenXLab<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-math-base-7b2024-01-23
InternLM2-Math-Base-20BBase🤗internlm/internlm2-math-base-20bOpen in OpenXLab<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-math-base-20b2024-01-23
InternLM2-Math-7BChat🤗internlm/internlm2-math-7bOpen in OpenXLab<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-math-7b2024-01-23
InternLM2-Math-20BChat🤗internlm/internlm2-math-20bOpen in OpenXLab<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-math-20b2024-01-23

Performance

Pretrain Performance

We evaluate pretrain checkpoints based on greedy decoding with few-shot COT. Details of pretraining will be introduced in the tech report.

BenchmarkGSM8K MAJ@1GSM8K MAJ@100MATH MAJ@1MATH MAJ@256
Llama2-7B14.6-2.5-
Llemma-7B36.454.018.033.5
InternLM2-Base-7B36.5-8.6-
InternLM2-Math-Base-7B49.275.721.535.6
Minerva-8B16.228.414.125.4
InternLM2-Base-20B54.6-13.7-
InternLM2-Math-Base-20B63.784.827.346.2
Llemma-34B51.569.325.043.1
Minerva-62B52.468.527.643.4
Minerva-540B58.878.533.650.3

We evaluate pretrain checkpoints using few-shot on MiniF2F. Please see eval/pretrain/minif2f for evaluation.

BenchmarkMiniF2F-test
ReProver26.5
LLMStep27.9
Code-Llama-7B26.2
Code-Llama-34B25.8
Llemma-7B26.2
Llemma-34B25.8
InternLM2-Math-7B-Base30.3
InternLM2-Math-20B-Base29.5

SFT Peformance

All performance is based on greedy decoding with COT. We notice that the performance of Hungary has a big variance between our different checkpoints, while other performance is very stable. This may be due to the problem amount about Hungary.

ModelModel TypeGSM8KMATHHungary
Qwen-7B-ChatGenearl51.711.6-
DeepSeek-7B-ChatGeneral63.015.828.5
InternLM2-Chat-7BGeneral70.723.0-
ChatGLM3-6BGeneral53.820.432
MetaMath-Mistral-7BMathematics77.728.229
MetaMath-Llemma-7BMathematics69.230.0-
InternLM2-Math-7BMathematics78.134.655
InternLM2-Chat-20BGeneral79.631.9-
MetaMath-Llemma-34BMathematics75.834.8-
InternLM2-Math-20BMathematics82.637.766
Qwen-72BGeneral78.935.252
DeepSeek-67BGeneral84.132.658
ChatGPT (GPT-3.5)General80.834.141
GPT4 (First version)General92.042.568

Code Intepreter Performance

All performance is based on interacting with Python.

ModelGSM8KMATH
DeepSeek-Coder-Instruct-7B62.828.6
DeepSeek-Coder-Instruct-1.5-7B72.634.1
ToRA-7B72.644.6
MathCODER-CL-7B67.830.2
InternLM2-Chat-7B77.945.1
InternLM2-Math-7B79.450.9
ToRA-13B75.848.1
MathCODER-CL-13B74.135.9
InternLM2-Chat-20B84.551.2
InternLM2-Math-20B80.754.3
MathCODER-CL-34B81.745.2
ToRA-70B84.349.7
GPT-4 Code Interpreter *97.069.7

Eval

You can effortlessly evaluate InternLM2-Math across a diverse array of mathematical datasets, such as Math and GSM8K, using OpenCompass with a single command. To get started, simply execute the following in your terminal after installing OpenCompass:

python run.py --models hf_internlm2_chat_math_7b --datasets gsm8k_gen math_gen_736506

Alternatively, for a streamlined experience, you can utilize a predefined configuration file. To do this, run the command below, making sure to adjust the arguments according to your requirements:

python run.py config/eval_internlm_math_chat.py

Inference

LMDeploy

We suggest using LMDeploy(>=0.2.1) for inference.

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig

backend_config = TurbomindEngineConfig(model_name='internlm2-chat-7b', tp=1, cache_max_entry_count=0.3)
chat_template = ChatTemplateConfig(model_name='internlm2-chat-7b', system='', eosys='', meta_instruction='')
pipe = pipeline(model_path='internlm/internlm2-math-7b', chat_template_config=chat_template, backend_config=backend_config)

problem = '1+1='
result = pipe([problem], request_output_len=1024, top_k=1)

Huggingface

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-math-7b", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("internlm/internlm2-math-7b", trust_remote_code=True, torch_dtype=torch.float16).cuda()
model = model.eval()
response, history = model.chat(tokenizer, "1+1=", history=[], meta_instruction="")
print(response)

Special usages

We list some instructions used in our SFT. You can use them to help you. You can use the other ways to prompt the model, but the following are recommended. InternLM2-Math may combine the following abilities but it is not guaranteed.

Translate proof problem to Lean: nl2lean3

Using Lean 3 to solve GSM8K problem: gsm8k_lean

Generate problem based on Lean 3 code: lean_problem

Play 24 point game: 24

Augment a harder math problem: augment_hard

DescriptionQuery
Solving question via chain-of-thought{Question}
Solving question via Lean 3{Question}\nSolve this via Lean 3
Outcome reward modelGiven a question and an answer, check is it correct?\nQuestion:{Question}\nAnswer:{COT}
Process reward modelGiven a question and an answer, check correctness of each step.\nQuestion:{Question}\nAnswer:{COT}
Reward modelGiven a question and two answers, which one is better? \nQuestion:{Question}\nAnswer 1:{COT}\nAnswer 2:{COT}
Convert chain-of-thought to Lean 3Convert this answer into Lean3. Question:{Question}\nAnswer:{COT}
Convert Lean 3 to chain-of-thoughtConvert this lean 3 code into a natural language problem with answers:\n{LEAN Code}
Translate question and chain-of-thought answer to a proof statementConvert this question and answer into a proof format.\nQuestion:{Question}\nAnswer:{COT}
Translate proof problem to Lean 3Convert this natural langauge statement into a Lean 3 theorem statement:{Theorem}
Translate Lean 3 to proof problemConvert this Lean 3 theorem statement into natural language:{STATEMENT}
Suggest a tactic based on Lean stateGiven the Lean 3 tactic state, suggest a next tactic:\n{LEAN State}
Rephrase ProblemDescribe this problem in another way. {Question}
Augment ProblemPlease augment a new problem based on: {Question}
Augment a harder ProblemIncrease the complexity of the problem: {Question}
Change specific numbersChange specific numbers: {Question}
Introduce fractions or percentagesIntroduce fractions or percentages: {Question}
Code Interpreterlagent
In-context LearningQuestion:{Question}\nAnswer:{COT}\n...Question:{Question}\nAnswer:{COT}

Fine-tune and others

Please refer to InternLM.

Known issues

Our model is still under development and will be upgraded. There are some possible issues of InternLM-Math. If you find performances of some abilities are not great, welcome to open an issue.

Citation and Tech Report

@misc{ying2024internlmmath,
      title={InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning}, 
      author={Huaiyuan Ying and Shuo Zhang and Linyang Li and Zhejian Zhou and Yunfan Shao and Zhaoye Fei and Yichuan Ma and Jiawei Hong and Kuikun Liu and Ziyi Wang and Yudong Wang and Zijian Wu and Shuaibin Li and Fengzhe Zhou and Hongwei Liu and Songyang Zhang and Wenwei Zhang and Hang Yan and Xipeng Qiu and Jiayu Wang and Kai Chen and Dahua Lin},
      year={2024},
      eprint={2402.06332},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{ying2024lean,
      title={Lean Workbook: A large-scale Lean problem set formalized from natural language math problems}, 
      author={Huaiyuan Ying and Zijian Wu and Yihan Geng and Jiayu Wang and Dahua Lin and Kai Chen},
      year={2024},
      eprint={2406.03847},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{wu2024leangithubcompilinggithublean,
      title={LEAN-GitHub: Compiling GitHub LEAN repositories for a versatile LEAN prover}, 
      author={Zijian Wu and Jiayu Wang and Dahua Lin and Kai Chen},
      year={2024},
      eprint={2407.17227},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2407.17227}, 
}
@misc{wu2024internlm25stepproveradvancingautomatedtheorem,
      title={InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems}, 
      author={Zijian Wu and Suozhi Huang and Zhejian Zhou and Huaiyuan Ying and Jiayu Wang and Dahua Lin and Kai Chen},
      year={2024},
      eprint={2410.15700},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2410.15700}, 
}