Home

Awesome

ChatGPT Math Word Problem Evaluation

The emergence of large language models (LLMs) have gained much popularity in recent years, with OpenAI's GPT-3 series models being considered as the state-of-the-art. In particular, the variant of GPT-3 tuned for natural dialog, known as ChatGPT, has gathered much popular interest. However, LLMs have known performance issues, specifically when reasoning tasks are involved. This project aims to investigate aspects of math word problems (MWPs) that can indicate the success or failure of ChatGPT on such problems.

Dataset

The dataset used in this project is the DRAW-1K dataset, which includes 1,000 MWPs with associated answers and template algebraic equations that one would use to solve such a word problem. Each object includes the following information:

The complete DRAW-1K dataset can be found on: [S. Upadhyay, M.-W. Chang, Annotating derivations: A new evaluation strategy and dataset for algebra word problems](URL: http://arxiv.org/abs/1609.07197).

Test Results

ChatGPT was evaluated at various times over the course of this experiment, the results can be found in the following files:

<br/> The objects within the array contain three keys:

The possible values for the "result" key are:

The test1.json file can be found here: test1.json

Usage

Here is an example of how you can read the JSON file in Python:

import json

# Open the file
with open('test.json') as json_file:
    data = json.load(json_file)

# Iterate through the array of objects
for question in data["Answers"]:
    question_number = question["question_No"]
    final_answer = question["final_answer"]
    result = question["result"]

Results

ChatGPT was evaluated at various times over the course of this experiment, the results are as follows:
[Tested in January, 2023] ChatGPT performance (no working)
The results of this project show that, in January, when asked to remove working, ChatGPT failed in 84% of DRAW-1K problems (partial and rounded solutions are identified as correct).

[Tested in February, 2023] ChatGPT performance (no working) The results of this project show that, in February, when asked to remove working, ChatGPT failed in 84% of DRAW-1K problems (partial and rounded solutions are identified as correct).

[Tested in February, 2023] ChatGPT plus performance (with working) The results of this project show that, in February, without asking to remove working, ChatGPT failed in 20% of DRAW-1K problems (partial and rounded solutions are identified as correct).

For more detailed information on the project, please refer to the full paper linked here: ShakarianEtAl_ChatGPT_MWP