Home

Awesome

<img src="./fig/abel.png" width="20" class="left"> Generative AI for Math: Abel

Model | Leaderboard | Methodology | Evaluation | Robustness Analysis | Limitation | Citation | Outlook |

Ethan Chern*, Haoyang Zou*, Xuefeng Li*, Jiewen Hu*, Kehua Feng, Junlong Li, Pengfei Liu+


News

πŸ”₯[2023/12/12] We released Abel-7B-002, resulting in a stronger (35% improvement on GSM8K, 126% improvement on MATH) and more generalized model, achieving the best performance among all 7B models (80.44 on GSM8K, 29.46 on MATH)

Models and Performance

Model NameHF CheckpointsGSM8kMATHLicense
Abel-7B-002πŸ€— <a href="https://huggingface.co/GAIR/Abel-7B-002" target="_blank">7B</a>80.4429.46Apache License 2.0
Abel-7B-001πŸ€— <a href="https://huggingface.co/GAIR/GAIRMath-Abel-7b" target="_blank">7B</a>59.7413.00Llama 2
Abel-13B-001πŸ€— <a href="https://huggingface.co/GAIR/GAIRMath-Abel-13b" target="_blank">13B</a>66.4117.34Llama 2
Abel-70B-001πŸ€— <a href="https://huggingface.co/GAIR/GAIRMath-Abel-70b" target="_blank">70B</a>83.6228.26Llama 2

Generalization

ModelGSM8kMATHMathQASVAMPSCQ5K-ENARC-EARC-CHellaSwagMMLU
Abel-7B-00280.4429.4669.7877.6755.9577.6755.0577.7261.19
Abel-7B-00159.74131.2157.679.353.3238.9763.5140.59
MetaMath-Mistral-7B77.728.233.9479.3337.678.4851.9376.4461.93
Qwen-7b47.849.3427.445340.0574.9753.0586.8557.98
Mistral-7b37.839.0625.736339.676.8353.2276.3164.05
Yi-6b32.65.7826.9855.6735.573.6649.5368.9764.02
LLaMA2-7b12.962.7811.524428.2471.1246.6171.3246.7

It can be found that:

Evaluation details:

Introduction

πŸ“ Abel is created as a tribute to Niels Henrik Abel for his groundbreaking work in algebra and analysis, at which our model is relatively better as well. There is still a long way for us to go, though πŸƒβ€β™‚οΈπŸƒβ€β™€οΈπŸπŸƒβ€β™‚οΈπŸƒβ€β™€οΈ.

We show that:

We have established a new state-of-the-art performance across open-source LLMs (that do not use external tools) on the GSM8k (83.62) and MATH (28.26) benchmarks. Specifically:

We demonstrate that:

Leaderboard for Mathematical Reasoning

RankingModelParam.Leading OrganizationGSM8KMATH
πŸ”’ 1GPT-4unknownOpenAI92.042.5
πŸ”’ 2Claude-2unknownAnthropic88.0-
πŸ”’ 3PaLM-2-FlanunknownGoogle84.733.2
🌍 4GAIRMath-Abel70BπŸŽ“ GAIR Lab at Shanghai Jiaotong University83.628.3
🌍 5WizardMath70BMicrosoft81.622.7
πŸ”’ 6Claude-InstantunknownAnthropic80.9-
πŸ”’ 7ChatGPTunknownOpenAI80.834.1
🌍 4Abel-0027BπŸŽ“ GAIR Lab at Shanghai Jiaotong University80.429.5
πŸ”’ 8ChatGPT-0301unknownOpenAI74.9-
🌍 9GAIRMath-Abel13BπŸŽ“ GAIR Lab at Shanghai Jiaotong University66.417.3
🌍 10GAIRMath-Abel7BπŸŽ“ GAIR Lab at Shanghai Jiaotong University59.713.0
πŸ”’ 11Minerva540BGoogle58.833.6
πŸ”’ 12PaLM540BGoogle56.98.8
🌍 13Llama-270BMeta56.813.5
🌍 14RFT33BOFA56.57.4
🌍 15Baichuan2-13B13BBaichuan52.810.1
πŸ”’ 16Minerva62BGoogle52.427.6
πŸ”’ 17PaLM64BGoogle52.44.4
🌍 18RFT13BOFA52.15.1
🌍 19LlaMA65BMeta50.910.6
🌍 20QWen7BAlibaba44.98.5
πŸ”’ 21Chinchilla70BDeepMind43.7-
🌍 22Llama-234BMeta42.26.24
πŸ”’ 23Galactica30BMeta41.712.7
🌍 24ChatGLM212BZhipu40.9-
πŸ”’ 25Text-davinci-002175BOpenAI40.719.1
🌍 26Llama33BMeta35.67.1
πŸ”’ 27GPT-3175BOpenAI345.2
🌍 28InternLM7BShanghai AI Lab31.2-
🌍 29Llama-213BMeta28.73.9
🌍 30Vicuna v1.313BLMSys27.6-
🌍 31Falcon40BTechnology Innovation Institute19.62.5
🌍 32Llama13BMeta17.83.9
🌍 33MPT30BMosaicML15.23.1
πŸ”’ 34Galactica6.7BMeta10.22.2

Methodology

We propose Parental Oversight, A Babysitting Strategy for Supervised Fine-tuning,

Parental Oversight is not limited to any specific data processing method. Instead, it defines the data processing philosophy that should guide supervised fine-tuning in the era of Generative AI (GAI). We believe that in the era of GAI, data structure engineering has emerged as a new paradigm. Within this paradigm, the manner in which the fine-tuning data is processed significantly impacts the performance of the trained GAI. We expect a growing number of studies in the community to focus on this data processing philosophy.

The principle of Parental Oversight emphasizes treating supervised fine-tuning with care and prudence. This is analogous to the way parents are encouraged to educate their children. Different types of data, along with their presentation formats (e.g., step-by-step reasoning, iterative refinement), can be likened to varied educational methods. Just as parents cautiously select the most effective approach to instruct their children, GAI practitioners should cautiously select the most effective data processing approaches to better instruct their LLMs.

Furthermore, the "the more data, the better" philosophy doesn't always hold true. The quality and relevance of annotated samples can often outweigh their quantity. Training samples used in SFT should not just present the right answer, but also instruct the model on how the correct answer was derived based on the knowledge of the LLM. Additionally, if the LLM's knowledge is not sufficient to answer a question, Parental Oversight should step in to address the knowledge gaps promptly.

Evaluation

Robustness Analysis

Our robustness analysis consists of two parts: Adversarial Evaluation on the GSM8k_robust dataset and Supervised Transfer Learning on the TAL-SCQ5K-EN dataset. We perform a preliminary analysis to understand (1) whether Abel overfits the training dataset and is thus brittle to out-of-distribution testing samples and (2) whether our SFT approach can quickly transfer and generalize Abel to datasets from different distributions.

Adversarial Evaluation on the GSM8k_robust Dataset

The GSM8k_robust dataset is a dataset we established based on the GSM8k dataset. We randomly modified the numbers within the questions of the GSM8k dataset, without altering any other information in the questions, using GPT-4. We also asked GPT-4 to generate the 'golden answers' for the modified questions. After manually reviewing a subset of these samples, we found that all the generated answers for the altered questions were accurate. We utilized the GSM8k_robust dataset to evaluate whether the models overfit the training data, making the models susceptible to out-of-distribution testing samples. Our analysis indicates that Abel is more robust to out-of-distribution testing samples compared to other models.

ModelGSM8kGSM8k_robustdelta
Abel-7B59.7458.23-1.51
Abel-13B66.4166.57+0.16
Abel-70B83.6281.80-1.82
WizardMath-70B81.6074.91-6.70
WizardMath-13B63.9059.51-4.39
RFT-7B41.737.98-3.72

Supervised Transfer Learning on the TAL-SCQ5K-EN Dataset

We demonstrate that Abel-70B not only achieves SOTA on the GSM8k and MATH datasets but also generalizes well to TAL-SCQ5K-EN 2K, a newly released dataset by Math LLM provider TAL (ε₯½ζœͺδΎ†). Our analysis indicates that our SFT approach can successfully generalize Abel to datasets from different distributions. We will conduct further analyses and experiments to explore and improve Abel's generalization capabilities.

ModelTAL-SCQ5K-EN 2K Testing Benchmark
Abel-70B59.7
MathGPT59.0
GPT-451.0
Llama-70B43.8

Demo

<img src="./fig/gsm8k_comparison.png"> <img src="./fig/MATH_comparison.png">

Limitation

We have created a list of issues to maintain these limitations and potential solutions. Your opinions and comments are always welcome.

Citation

Please cite the repo if the model/code/conclusion in this repo are helpful to you.

@misc{abel,
  author = {Chern, Ethan and Zou, Haoyang and Li, Xuefeng and Hu, Jiewen and Feng, Kehua and Li, Junlong and Liu, Pengfei},
  title = {Generative AI for Math: Abel},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/GAIR-NLP/abel}},
}

Acknowledgement

Outlook

We are continuously refining our models and will be releasing updates. Stay tuned! <img src="./fig/plan.png" width="600" class="left">