Home

Awesome

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

Code, model input/output and cached evaluation results for our ACL-23 paper "Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters" by Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer and Huan Sun.

Overview

While Chain-of-Thought (CoT) prompting can improve reasoning in large LMs, there is little understanding of what makes it effective. We perform a series of ablation studies on two representive benchmarks where CoT brings large improvements, which reveal the impact of different aspects of CoT demonstrations. We find that

Overall, these findings open up new questions regarding LLMs' capability to learn to reason in context, and reflections on benchmarking few-shot reasoning.

Citation

If you find our code or paper useful, please cite the paper:

@inproceedings{wang2023towards,
  title={Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters},
  author={Wang, Boshi and Min, Sewon and Deng, Xiang and Shen, Jiaming and Wu, You and Zettlemoyer, Luke and Sun, Huan},
  booktitle={The 61st Annual Meeting of the Association for Computational Linguistics},
  year={2023}
}

Repo Tour

.
├── grade-school-math/                       # GSM8K dataset, from https://github.com/openai/grade-school-math
├── indices_800.json                         # Indices for the 800 GSM8K test examples used for evaluation 
├── Bamboogle Prerelease - Sheet1.csv        # Bamboogle dataset, from https://github.com/ofirpress/self-ask
├── Bamboogle Prerelease - Sheet1_inter.csv  # Annotated intermediate bridging entities for Bamboogle
├── utils.py                                 # Helper functions
├── prompts_*/                               # Full prompts for all settings in our experiments
├── main_*.py                                # Scripts for getting model predictions via OpenAI API
├── eval_*.ipynb                             # Evaluation scripts, including cached evaluation results
└── result_*/                                # Cached model prediction results 

Usage

First put your OpenAI API key in a file named api_key.txt.

Run LLM generation

Details could be found in the param descriptions in main_*.py. For example, to run the invalid reasoning setting on GSM8K and Bamboogle:

python main_gsm8k.py --prompt_dir prompts_arithmetic/invalid_reasoning.txt --eng text-davinci-002 --num_test 800 --seed 1357 --temp 0.0 --test_ind indices_800.json
python main_bamboogle.py --prompt_dir prompts_bamboogle/invalid_reasoning.txt --eng text-davinci-002 --num_test -1 --seed 1357 --temp 0.0

Evaluation

eval_*.ipynb contains the scripts and cached evaluation results.