Home

Awesome

PrOntoQA and PrOntoQA-OOD

This repo contains PrOntoQA-OOD, as described in our papers:

  1. Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought
  2. Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

PrOntoQA and PrOntoQA-OOD generate question-answering examples with chains-of-thought that describe the reasoning required to answer the questions correctly. The sentences in the examples are syntactically simple and amenable to semantic parsing, and so this code can be used to formally analyze the predicted chain-of-thought from large language models.

Note: The v1 branch contains the version of the repo corresponding to the original PrOntoQA paper.

Update: (Oct 17, 2024) The datasets in generated_ood_data.zip were regenerated to incorporate the latest bug fixes.

If you use our code in your work, please cite our papers:

@inproceedings{
  PrOntoQA,
  title={Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought},
  author={Abulhair Saparov and He He},
  booktitle={The Eleventh International Conference on Learning Representations},
  year={2023},
  url={https://openreview.net/forum?id=qFVVBzXxR2V}
}

@article{
  PrOntoQAOOD,
  title={Testing the General Deductive Reasoning Capacity of Large Language Models Using {OOD} Examples},
  author={Abulhair Saparov and
          Richard Yuanzhe Pang and
          Vishakh Padmakumar and
          Nitish Joshi and
          Seyed Mehran Kazemi and
          Najoung Kim and
          He He},
  journal={CoRR},
  volume={abs/2305.15269},
  year={2023},
  url={https://doi.org/10.48550/arXiv.2305.15269},
  doi={10.48550/arXiv.2305.15269},
  eprinttype={arXiv},
  eprint={2305.15269},
}

Running experiments

To generate the examples and evaluate models, use run_experiment.py. There are a number of command-line flags:

The output of the experiments are written to a file whose name is automatically determined based on the above flag configuration.

The model outputs from our experiments are provided in model_outputs_v1.zip (for the original PrOntoQA) and model_outputs_ood.zip (for PrOntoQA-OOD).

Generating data without running experiments

To generate data in JSON format, use the run_experiment.py script with the flag --model-name json. See the above section for details on the other arguments.

The generated data for our experiments is available in generated_ood_data.zip.

Analyzing output

Once run_experiment.py has saved the model predictions to files, they can be analyzed with analyze_results.py. Without any arguments, this script will reproduce all results figures in our paper. The script make_plots.py generates all the plots in the PrOntoQA-OOD paper. To analyze the output of a single file, run analyze_results.py <filename>. This script supports the reading of both JSON-formatted output files as well as the log files output by run_experiment.py. The expected JSON format is as follows:

{
  "example1": {
    ...
    "test_example": {
      ...
      "model_output": <model output as a string, including the predicted label>
    }
  },
  "example2": {
    ...
    "test_example": {
      ...
      "model_output": <model output as a string, including the predicted label>
    }
  },
  ...
}