Home

Awesome

Faithful-COT

Code and data accompanying our paper "Faithful Chain-of-Thought Reasoning" in IJCNLP-AACL 2023.

Table of Contents

News 📣

GSM8KSVAMPMultiArithASDivAQUAsaycanStrategyQAdatesportsCLUTRR
Codex72.283.598.880.247.289.363.081.699.158.9
ChatGPT75.883.095.381.753.580.651.573.552.312.1
GPT-495.095.398.595.673.692.254.095.899.362.7

With GPT-4, Faithful CoT achieves❗95.0+ few-shot accuracy❗on almost all Math Word Problem datasets, Date Understanding, and Sports Understanding.

See output_dir/performance_summary.csv for detailed results and output_dir/{dataset_name} for model predictions.

Get started

We suggest using miniconda/conda to set up the environment. The environment.yml file specifies the minimal dependencies. You can create a virtual environment using it according to this guildeline.

Essentially, you'll need to do something like:

cd /path/to/Faithful-COT
conda env create -f ./environment.yml --prefix ./envs

Additionally, to run experiments on StrategyQA, you should install Soufflé (a version of Datalog interpreter we use) following these instructions. It's not a Python package, so you'll need to install it separately. Note that under the "Installing Souffle" section you should use -DCMAKE_INSTALL_PREFIX="~/.local" for it to be installed to the right place.

Repo Structure

Usage

Make predictions

  1. Provide your OpenAI API key(s) by creating a file called key.py under source/ in the following format:
API_KEYS = {
	"key1_nickname": "key1",
	"key2_nickname": "key2",
	...
}

Note that your keys should have access to the relevant LM (code-davinci-002, etc.) specified in the configuration you'd like to use.

  1. Choose a model configuration you'd like to use. You can use an existing configuration under configuration/config_files/{dataset_name} or create a new one. See configuration/README.md for details.

  2. Run source/predict/predict.py:

$ python predict.py -h
usage: predict.py [-h]
                  [--dataset_name {GSM8K,ASDiv,MultiArith,SVAMP,AQUA,date,StrategyQA,sports,saycan,CLUTRR}]
                  [--split {train,dev,test}] [--model_name MODEL_NAME]
                  [--completion_only] [--debug]

optional arguments:
  -h, --help            show this help message and exit
  --dataset_name {GSM8K,ASDiv,MultiArith,SVAMP,AQUA,date,StrategyQA,sports,saycan,CLUTRR}
                        The name of the dataset.
  --split {train,dev,test}
                        The split of the dataset.
  --model_name MODEL_NAME
                        The name of the model (should have a corresponding
                        config file under `configuration/config_files/dataset_name` called
                        `{model_name}.json`.)
  --completion_only     Only query the LM to generate the completion
                        (reasoning chain), but not execute the solver to
                        derive the answer.
  --debug               If true, only run on the first 10 examples.

Example:

nohup python predict.py --model_name code002_NL+SL --dataset_name GSM8K --split test > logs/GSM8K/code002_NL+SL_test.log 2>&1 &

The model predictions will be saved under output_dir/{dataset_name}/{split}/{model_name}. See output_dir/README.md for details on the format.

Tips:

Evaluate the model predictions

Run source/evaluate/evaluate_answer_acc.py with the following arguments:

$ python evaluate_answer_acc.py -h
usage: evaluate_answer_acc.py [-h]
                              [--dataset_name {GSM8K,ASDiv,MultiArith,SVAMP,AQUA,date,StrategyQA,sports,saycan,CLUTRR}]
                              [--split {train,dev,test}]
                              [--model_name MODEL_NAME] [--non_empty_only]
                              [--valid_only] [--debug]

optional arguments:
  -h, --help            show this help message and exit
  --dataset_name {GSM8K,ASDiv,MultiArith,SVAMP,AQUA,date,StrategyQA,sports,saycan,CLUTRR}
                        The name of the dataset.
  --split {train,dev,test}
                        The split of the dataset.
  --model_name MODEL_NAME
                        The name of the model (should have a corresponding
                        config file under
                        `configuration/config_files/dataset_name` called
                        `{model_name}.json`.)
  --non_empty_only      If true, only evaluate on non-empty answers.
  --valid_only          If true, only evaluate on valid answers.
  --debug               If true, only run on the first 10 examples.

The accuracy will be printed to stdout.

Example:

python evaluate_answer_acc.py --model_name code002_NL+SL --dataset_name GSM8K --split test

Output:

Dataset: GSM8K
Split: test
Model: code002_NL+SL
Answer accuracy: 72.2

Get a performance summary table

To reproduce the numbers in our paper, you can generate a performance summary table for all model configurations on all datasets in output_dir/ by running source/evaluate/gen_perf_table.py (no arguments needed).

Example:

python gen_perf_table.py

The output will be saved to output_dir/performance_summary.csv. If there do not exist predictions for a certain model configuration on a dataset, the corresponding cell will be empty.

Citation

If you find this repository useful, please cite our paper:

@article{lyu2023faithful,
  title={Faithful chain-of-thought reasoning},
  author={Lyu, Qing and Havaldar, Shreya and Stein, Adam and Zhang, Li and Rao, Delip and Wong, Eric and Apidianaki, Marianna and Callison-Burch, Chris},
  journal={arXiv preprint arXiv:2301.13379},
  year={2023}
}

Funding Acknowledgements

This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.