Awesome
ChatGPT Causal Reasoning Evaluation
This project contains the code of paper:
Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation. [Paper-ArXiv]
Accepted by the Findings of EMNLP 2023.
1. Install
The code relies primarily on Python and the OpenAI API.
You need to execute the following command:
# first, create conda environment
conda env create -f conda_environment.yml
conda activate <name of the env you created>
# second, install other python package
pip install -r pip_requirements.txt
# Finally, before starting the code, you will need to prepare an OpenAI API key.
2. Predict and Evaluation for ChatGPT
<font color=red>For the following code files, we provide detailed usage instructions within each code file. They are easy to understand and can be reused to explore more experimental settings that interest you., <u>but require you to provide your own openAI API key</u></u>.</font>
2.1 Zero-shot ChatGPT
ECI:
predict: ECI.py
evaluate: ECI_compute_score.py
multi-choice CD:
predict: CD_multi_choice.py
evaluate: CD_and_CEG_compute_score.py
binary-classification CD:
predict: CD_binary_classification.py
evaluate: CD_and_CEG_compute_score.py
CEG:
predict: CEG.py
automic evaluate: CD_and_CEG_compute_score.py
human evaluate: CEG_human_evaluation.xlsx
2.2 ChatGPT with ICL or CoT
ECI:
predict: ECI.py
evaluate: ECI_compute_score.py
binary-classification CD:
predict: CD_binary_classification.py
evaluate: CD_and_CEG_compute_score.py
2.3 ChatGPT Using Prompts That Express the Causality in Different Ways
predict: ECI_differ_causal_prompts.py
evaluate: ECI_differ_causal_prompts_compute_score.py
2.4 ChatGPT Using Prompts in the Form of Open-Ended Generation
# To facilitate coding, we conducted this experiment in a dependent directory, with all the code and data located in the "ECI_open_ended_generation_prompts" folder.
predict:
ECI_open_ended_generaton_prompts/chatgpt_ECI_openA123.py
ECI_open_ended_generaton_prompts/chatgpt_ECI_openB.py
evaluate:
ECI_open_ended_generaton_prompts/cal_prf_zero_shot_prompt_A1.py
ECI_open_ended_generaton_prompts/cal_prf_zero_shot_prompt_A2.py
ECI_open_ended_generaton_prompts/cal_prf_zero_shot_prompt_A3.py
ECI_open_ended_generaton_prompts/cal_prf_zero_shot_prompt_B.py
3. Other Directories
-- data # five datasets used in our experiments
-- output_by_ChatGPT # output of 4 versions of ChatGPT in ECI, CD and CEG task
-- ECI_open_ended_generaton_prompts # code and ChatGPT's output with the open-ended generation prompts in the ECI task
-- utils # other tool code
Citation
If you find our reports benifit your research, please cite the following paper:
@inproceedings{gao-etal-2023-chatgpt,
title = "Is {C}hat{GPT} a Good Causal Reasoner? A Comprehensive Evaluation",
author = "Gao, Jinglong and
Ding, Xiao and
Qin, Bing and
Liu, Ting",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.743",
doi = "10.18653/v1/2023.findings-emnlp.743",
pages = "11111--11126",
abstract = "Causal reasoning ability is crucial for numerous NLP applications. Despite the impressive emerging ability of ChatGPT in various NLP tasks, it is unclear how well ChatGPT performs in causal reasoning. In this paper, we conduct the first comprehensive evaluation of the ChatGPT{'}s causal reasoning capabilities. Experiments show that ChatGPT is not a good causal reasoner, but a good causal interpreter. Besides, ChatGPT has a serious hallucination on causal reasoning, possibly due to the reporting biases between causal and non-causal relationships in natural language, as well as ChatGPT{'}s upgrading processes, such as RLHF. The In-Context Learning (ICL) and Chain-of-Thought (CoT) techniques can further exacerbate such causal hallucination. Additionally, the causal reasoning ability of ChatGPT is sensitive to the words used to express the causal concept in prompts, and close-ended prompts perform better than open-ended prompts. For events in sentences, ChatGPT excels at capturing explicit causality rather than implicit causality, and performs better in sentences with lower event density and smaller lexical distance between events.",
}