Home

Awesome

[EMNLP 2024] This is the official implementation of the paper "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners" in PyTorch.

Arxiv Google Scholar Related Work

🚀News! The full paper has been accepted to the EMNLP 2024 Main.

🚀News! A short version of this work has been accepted to the ICML 2024 Workshop on LLMs and Cognition.

🐦News! We released a short, easy-to-watch video on Twitter. Enjoy!

Introduction

Large language models (LLMs) have achieved remarkable progress in understanding and generating human-like text, but there is ongoing debate about whether LLMs possess genuine reasoning capabilities. This work reconceptualizes the evaluation of LLM's reasoning capabilities into a general and rigorous testing framework with statistical guarantee.

We say that an LLM is subject to token bias in a reasoning task if systematic changes to some or all tokens in the task descriptions - while keeping the underlying logic intact - allow us to predict the direction of the shift in the model’s output. A strong token bias suggests that LLM is relying on superficial patterns in the input rather than truly understanding the underlying reasoning task, leading to brittle performance that fails to generalize well. Let us look at the following classic "twenty-five horses" problem in graph theory:

You want to find the fastest 3 horses in a group of horses. You can only race 5 horses at a time. You don’t have a stopwatch, so you can only know the ranking of each horse within each race. How many races do you need?

<p align="center"> <img src=figures/horses.png /> </p>

GPT-4 and Claude-3-opus achieve an accuracy of nearly 98.5% and 40.5% in answering this question. However, if we simply perturb "horses" to "bunnies", a change that shouldn't affect the logical essence, would systematically decrease the accuracy to 85.0% and 30.0%, respectively. Further changing "25" to other values decreases their accuracy to 46.0% and 24.0%. These observations indicate strong token biases on the frequently-used names "horses" and "25" in such problems, and LLMs do not have a genuine understanding of how it should solve such problems.

You want to find the fastest 3 bunnies in a group of bunnies. You can only race 5 bunnies at a time. You don’t have a stopwatch, so you can only know the ranking of each bunny within each race. How many races do you need?

<p align="center"> <img src=figures/bunnies.png /> </p>

We take the classic Linda Problem in Psychology as another example. Below is the original problem statement.

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?

(a) Linda is a bank teller.

(b) Linda is a bank teller and is active in the feminist movement.

<p align="center"> <img src=figures/linda_persona.png /> </p>

Experiments in behavioral psychology reveal that people typically believed the second option was more likely than the first, but this contradicts the basic probability rule of conjunction. Advanced LLMs like GPT-4 can typically recognize this fallacy well since it is a classical problem that appears frequently in cognitive science literature. However, altering seemingly irrelevant tokens like the name :ok_woman: "Linda" -> 🙆 "Luna" in the problem statement, while maintaining the same logical structure would surprisingly confuse most LLMs. In one-shot learning, GPT-4 and Claude-3-opus would see their accuracy decrease from 100.0% to 72.0% and from 95.0% to 32.0%, respectively. (check detailed experiment setups in paper).

Luna is 29 years old, married, deeply passionate about environmental conservation and transgender rights, and volunteers their weekends at local park clean-ups. They studied physics and applied math in college, and held several campaigns to reduce the campus’s carbon footprint. Which is more probable?

(a) Luna is an assistant professor in aerospace engineering and is an active member of an environmental advocacy group.

(b) Luna is an assistant professor in aerospace engineering.

<p align="center"> <img src=figures/luna_persona.png /> </p>

In our paper, we explore many other token biases in logical reasoning, set theory, and mathematical reasoning problems. We reconceptualize the evaluation of reasoning capabilities into a general and rigorous statistical testing framework, moving beyond accuracy. We conclude, with statistical guarantee, that LLMs do not consistently apply genuine reasoning in their decision-making process, but primarily rely on token bias for response generation. Therefore, we raise concerns about the extent to which LLMs truly engage in reasoning; Any robust evaluation of the LLM's generalization should account for the fundamental impact of token bias hidden in the current benchmark problems.

<p> <em> All images are generated by OpenAI GPT-4o. When we requested 'lop-eared bunnies', the model even displayed a visual token bias by generating bunnies with four ears — both lop and erect — suggesting it associated the term 'bunnies' with the presence of two erect ears without genuine logical understandings. </em> </p>

Citation

All the twenty-five bunnies above 🐰 will be happy if you could cite our work. Thank you!

@article{jiang2024peek,
  title={A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners},
  author={Jiang, Bowen and Xie, Yangxinyu and Hao, Zhuoqun and Wang, Xiaomeng and Mallick, Tanwi and Su, Weijie J and Taylor, Camillo J and Roth, Dan},
  journal={arXiv preprint arXiv:2406.11050},
  year={2024}
}

Dependencies

Please check requirements.txt. You can run the following commands to create a virtual environment and install all the requirements:

python -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt

Dataset

We provide our synthetic dataset under data/, which contains a comprehensive set of logical-fallacy problems. The dataset file is in JSON format, and each item is a dictionary containing question_id, question, target_answer, and incorrect_answer. You can also follow the instructions below to generate more synthetic data on the fly.

LLM Setups

:heart: Always set up OpenAI ChatGPT models. Please follow its Developer quickstart to set up your OpenAI API, create a new api_tokens/openai_key.txt file, and copy and paste your API key into it.

:orange_heart: To use Google Gemini models with an API for inference, follow instructions on Google Vertex AI about the Try Gemini 1.0 Pro (Python) section. Note that your school's Gmail account may not allow you to make payments.

:yellow_heart: To use Meta Llama models with an API for inference, follow instructons on Replicate Run Llama 3 with an API about the Running Llama 3 with Python section to set up your API tokens, create a new api_tokens/llama_key.txt file, and copy and paste your tokens into it.

:green_heart: To use Anthropic Claude models with an API for inference, follow its Quickstart Guide to install the Anthropic Python SDK, set up an account with API access, get your API key, create a new api_tokens/claude_key.txt file, and copy and paste your key into it. You don't need to set the environment variable ANTHROPIC_API_KEY.

:blue_heart: To use Mistral models with an API for inference, follow its Quickstart to install the mistralai library, set up an account with API access, get your [API key](https://console.anthropic.com/settings/keys, create a new api_tokens/mistral_key.txt file, and copy and paste your key into it. You don't need to set the environment variable MISTRAL_API_KEY.

Quick Start

We allow command-line argparser for the following arguments:

For example, you can run

python main.py --model gpt3.5 --task data --fallacy linda --gen_mode control --variant original --n 100 --verbose

in the command line and adjust model, fallacy, gen_mode, variant, and n accordingly. All the other hyper-parameters can be set at config.yaml. Generated files will be saved to the data/ directory.

To start the inference

python main.py --model gpt3.5 --task inference --fallacy linda --eval_mode os_cot --data_file synthetic_dataset_linda_original_gold.json --verbose

in the command line and adjust model, eval_mode, and data_file accordingly.

To efficiently run the evaluation with multiple prompting methods, models, and/or data files in parallel, please modify the number of GPU devices available and adjust the codes in run.sh. Then run

bash run.sh

All results and final accuracies will be automatically saved to the outputs/ directory.