Home

Awesome

🪞Reflection-Bench: probing AI intelligence with reflection

<p align="center"> <img src="./figs/Las_Meninas.jpg" width="500"> <br> <em>Las Meninas</em>, Diego Velázquez, 1656 </p>

Motivation: to what extent LLMs possess intelligence?

Geoffrey Hinton endorsed the legislation of SB 1047, the strict AI safety bill, because he thinks that LLMs are actually reasoning and understanding. But Yann LeCun criticized that Hinton's premature worries are originated from the overestimation of LLMs intelligence in prediction, planning, common sense etc. Obviously, this longlasting debate on AI intelligence directly impacts our understanding of AI trustworthiness and regulation.

Lots of studies explore this question from various aspects such as reasoning, planning, cognitive flexibility, self cognition etc. These angles, however, seem to be interconnected in certain under-explained way related to epistemology of AI systems. So we aim to clarify this enigma from the lens of cognitive science and evaluate the general process of intelligence underlying above aspects.

Reflection: the general process of intelligent systems

<p align="center"> <img src="./figs/reflection.png" width="700"> <br> <em>Reflection & Meta-reflection</em> </p>

Intelligent systems existing in the uncertain world must interact with, learn about, and adapt to the environment. One emerging school in cognitive science describes these systems, from first principles, as predictive machines that keep predicting what will happen next with their internal models. Such a smart energy-saving strategy allows systems adapting to the environment flexibly only focusing on minimizing the unexpected by updating thoughts or actions. We define the general process of such intelligence in everyday life as reflection - predicting based on priors, making decision leading to desired state, perceiving mismatch between observation and prediction, updating prior belief accordingly. Reflection is a complex capability requiring cognitive elements including perception, memory, belief updating, decision making, prediction, counterfactual thinking, and meta-reflection.

Therefore, we can evaluate intelligence of AI systems by focusing on this general process of intelligence as well, i.e., Reflection-Bench.

Reflection-Bench

Reflection-Bench involves 7 tasks corresponding to different cognitive elements required in reflection.

<p align="center"> <img src="./figs/reflectionbench.png" width="900"> <br> <em>Reflection-Bench: assessment architecture</em> </p>

Experiment

Cognition focusTaskTrialsSessions
PerceptionOddball Paradigm503
Working memoryN-back task, n=2522
Belief updatingProbability Reversal Task (PRT), p = 0.9402
Decision makingWisconsin Card Sorting Task1082
PredictionWeather Prediction Task1002
CounterfactualDouble-Choice Iowa Gambling Task1002
Meta-reflectionMeta-PRT, interval = 3, p = 1602
Model$/1 M input tokens$/ 1 M output tokensActual Cost ($)
o1-preview1560281
o1-mini31257
gpt-4103045
gpt-4o51520.1
gpt-4o-mini0.150.60.6
claude-3.5-sonnet31516.5
gemini-1.5-pro3.510.512
llama-3.1-405b6618
llama-3.1-70b0.350.41.48
llama-3.1-8b0.050.050.27
qwen-2.5-72b0.571.710
qwen-2.5-32b0.510
qwen-2.5-14b0.280.560

Results

<p align="center"> <img src="./figs/outcomes.png" width="700"> </p> <p align="center"> <img src="./figs/overall_scores.png" width="700"> <br> <em>Performances of 13 moels on Reflection-Bench</em> </p>

Paper

For detailed information about Reflection-Bench, please read our papaer!

You can cite Reflection-Bench as:

@misc{li2024reflectionbenchprobingaiintelligence,
      title={Reflection-Bench: probing AI intelligence with reflection}, 
      author={Lingyu Li and Yixu Wang and Haiquan Zhao and Shuqi Kong and Yan Teng and Chunbo Li and Yingchun Wang},
      year={2024},
      eprint={2410.16270},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2410.16270}, 
}