Home

Awesome

This is the official repository of the paper "Towards Rationality in Language and Multimodal Agents: A Survey"

Arxiv Google Scholar

Unlike reasoning that aims to draw conclusions from premises, rationality ensures that those conclusions are reliably consistent, have an orderability of preference, and are aligned with evidence from various sources and logical principles. It becomes increasingly important for human users applying these agents in critical sectors like health care and finance that expect consistent and reliable decision-making. This survey is the first to comprehensively explore the notion of rationality ๐Ÿง  in language and multimodal agents ๐Ÿค–.

<p align="center"> <img src=header.png /> </p>

The fields of language and multimodal agents are rapidly evolving, so we highly encourage researchers who want to promote their amazing works on this dynamic repository to submit a pull request and make updates. ๐Ÿ’œ

We have a concurrent work A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners Paper Code at EMNLP 2024 Main, which reconceptualizes the evaluation of reasoning capabilities in LLMs into a general, statistically rigorous framework. Its findings reveal that LLMs are susceptible to superficial token perturbations, primarily relying on token biases rather than genuine reasoning.

Citations

This bunny ๐Ÿฐ will be happy if you could cite our work. (Google Scholar is still indexing our old title)

@misc{jiang2024multimodal,
      title={Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey}, 
      author={Bowen Jiang and Yangxinyu Xie and Xiaomeng Wang and Weijie J. Su and Camillo J. Taylor and Tanwi Mallick},
      year={2024},
      eprint={2406.00252},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

Define Rationality

Rationality is the quality of being guided by reason, characterized by logical thinking and decision-making that align with evidence and logical rules. Drawing on foundational works in cognitive science about rational decision-making, we present four necessary, though not sufficient, axioms we expect a rational agent or agent systems to fulfill:

Towards Rationality in Agents

<p align="center"> <img src=tree.png /> </p>

Bold fonts are used to mark work that involve multi-modalities. In their original writings, most existing studies do not explicitly base their frameworks on rationality. Our analysis aims to reinterpret these works through the lens of our four axioms of rationality, offering a novel perspective that bridges existing methodologies with rational principles.

1. Advancing Information Grounding

<p align="center"> <img src=grounding.png /> </p>

1.1. Grounding on multimodal information

CLIP: Learning transferable visual models from natural language supervision Paper Code
iNLG: Imagination-guided open-ended text generation Paper Code
BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models Paper Code
Vid2Seq: Large-scale pretraining of a visual language model for dense video captioning Paper Code
MiniGPT-4: Enhancing vision-language understanding with advanced large language models Paper Code
Flamingo: a visual language model for few-shot learning Paper
OpenFlamingo: An open-source framework for training large autoregressive vision-language models Paper Code
LLaVA: Visual Instruction Tuning Paper Code
LLaVA 1.5: Improved Baselines with Visual Instruction Tuning Paper Code
CogVLM: Visual expert for pretrained language models Paper Code
GPT-4V(ision) System Card Paper
Gemini: A Family of Highly Capable Multimodal Models Paper
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Paper
GPT-4o Website JEPA: A Path Towards Autonomous Machine Intelligence Paper
Voyager: An open-ended embodied agent with large language models Paper Code
Ghost in the Minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory Paper Code
Objective-Driven AI Slides
LWM: World Model on Million-Length Video And Language With RingAttention Paper Code
Sora: Video generation models as world simulators Website
IWM: Learning and Leveraging World Models in Visual Representation Learning Paper
CubeLLM: Language-Image Models with 3D Understanding Paper Code

1.2 Expanding working memory from external knowledge retrieval and tool utilization

RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Paper
Minedojo: Building open-ended embodied agents with internet-scale knowledge Paper Code
ReAct: Synergizing reasoning and acting in language models Paper Code
RA-CM3: Retrieval-Augmented Multimodal Language Modeling Paper
Chameleon: Plug-and-play compositional reasoning with large language models Paper Code
Chain of knowledge: A framework for grounding large language models with structured knowledge bases Paper Code
SIRI: Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering Paper
CooperKGC: Multi-Agent Synergy for Improving Knowledge Graph Construction Paper Code
DoraemonGPT: Toward understanding dynamic scenes with large language models Paper Code
WildfireGPT: Tailored Large Language Model for Wildfire Analysis Paper Code
Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models Paper
CuriousLLM: Elevating Multi-Document QA with Reasoning-Infused Knowledge Graph Prompting Paper Code
Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents Paper Visual Programming: Compositional visual reasoning without training Paper Code
Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions Paper Code
Toolformer: Language Models Can Teach Themselves to Use Tools Paper Code
BabyAGI Code
ViperGPT: Visual inference via python execution for reasoning Paper Code
HuggingGPT: Solving AI tasks with ChatGPT and its friends in Gugging Face Paper Code
Chameleon: Plug-and-play compositional reasoning with large language models Paper Code
AutoGPT: build & use AI agents Code
ToolAlpaca: Generalized tool learning for language models with 3000 simulated cases Paper Code
AssistGPT: A general multi-modal assistant that can plan, execute, inspect, and learn Paper Code
Avis: Autonomous visual information seeking with large language model agent Paper
BuboGPT: Enabling visual grounding in multi-modal llms Paper Code
MemGPT: Towards llms as operating systems Paper Code
MetaGPT: Meta programming for multi-agent collaborative framework Paper Code
Agent LUMOS: Learning agents with unified data, modular design, and open-source llms Paper Code
AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning Paper Code
Small LLMs Are Weak Tool Learners: A Multi-LLM Agent Paper Code
DeLLMa: A Framework for Decision Making Under Uncertainty with Large Language Models Paper Code
ConAgents: Learning to Use Tools via Cooperative and Interactive Agents Paper Code
Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering Paper Code

2. Advancing Logical Consistency

<p align="center"> <img src=consistency.png /> </p>

2.1. Consensus from reflection and multi-agent collaboration

CoT: Chain-of-thought prompting elicits reasoning in large language models Paper
Self-Refine: Iterative refinement with self-feedback Paper Code
Reflexion: Language agents with verbal reinforcement learning Paper Code
FORD: Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate Paper Code
Memorybank: Enhancing large language models with long-term memory Paper Code
LM vs LM: Detecting factual errors via cross examination Paper
Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents Paper
Improving factuality and reasoning in language models through multiagent debate Paper Code
MAD: Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate Paper Code
S3: Social-network Simulation System with Large Language Model-Empowered Agents Paper
ChatDev: Communicative agents for software development Paper Code
ChatEval: Towards better llm-based evaluators through multi-agent debate Paper Code
AutoGen: Enabling next-gen llm applications via multi-agent conversation framework Paper Code
Corex: Pushing the boundaries of complex reasoning through multi-model collaboration Paper Code
DyLAN: Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization Paper Code
AgentCF: Collaborative learning with autonomous language agents for recommender systems Paper
MetaAgents: Simulating interactions of human behaviors for llm-based task-oriented coordination via collaborative generative agents Paper
Social Learning: Towards Collaborative Learning with Large Language Models Paper
Enhancing Diagnostic Accuracy through Multi-Agent Conversations: Using Large Language Models to Mitigate Cognitive Bias Paper
Combating Adversarial Attacks with Multi-Agent Debate Paper
Debating with More Persuasive LLMs Leads to More Truthful Answers Paper Code

2.2 Consistent execution from symbolic reasoning and tool utilization

Refer 1.2 for tool unitization

Binder: Binding language models in symbolic languages Paper Code
Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions Paper Code
Sparks of artificial general intelligence: Early experiments with gpt-4 Paper
Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning Paper Code
Minding Language Models' (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker Paper Code
Towards formal verification of neuro-symbolic multi-agent systems Paper
Whatโ€™s Left? Concept Grounding with Logic-Enhanced Foundation Models Paper Code
Ada: Learning adaptive planning representations with natural language guidance Paper
Large language models are neurosymbolic reasoners Paper Code
DoraemonGPT: Toward understanding dynamic scenes with large language models Paper Code
A Neuro-Symbolic Approach to Multi-Agent RL for Interpretability and Probabilistic Decision Making Paper
Conceptual and Unbiased Reasoning in Language Models Paper

3. Advancing Invariance from Irrelevant Information

<p align="center"> <img src=invariance.png /> </p>

3.1. Representation invariance across modalities

3.2. Abstraction from symbolic reasoning and tool utilization

4. Advancing Orderability of Preference

<p align="center"> <img src=preference.png /> </p> ### 4.1. Learning preference from reinforcement learning ### 4.2. Maximizing utility functions and controlling conformal risks

Evaluating Rationality in Agents

0. General Benchmarks or Evaluation Metrics

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge Paper Data
LogiQA: a challenge dataset for machine reading comprehension with logical reasoning Paper Data
Logiqa 2.0: an improved dataset for logical reasoning in natural language understanding Paper Data
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering Paper Data
Measuring mathematical problem solving with the math dataset Paper Data
HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data Paper Data
Conceptual and Unbiased Reasoning in Language Models Paper
Large language model evaluation via multi AI agents: Preliminary results Paper
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents Paper Code
AgentBench: Evaluating LLMs as Agents Paper Code
Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation Paper Code
Llm-deliberation: Evaluating llms with interactive multi-agent negotiation games Paper Code

1. Evaluating Information Grounding

A multi-task, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity Paper Data
Hallucinations in large multilingual translation models Paper Data
Evaluating attribution in dialogue systems: The BEGIN benchmark Paper Data
HaluEval: A large-scale hallucination evaluation benchmark for large language models Paper Data
DialDact: A benchmark for fact-checking in dialogue Paper Data
FaithDial: A faithful benchmark for information-seeking dialogue Paper Data
AIS: Measuring attribution in natural language generation models Paper Data
Why does ChatGPT fall short in providing truthful answers Paper
FADE: Diving deep into modes of fact hallucinations in dialogue systems Paper Code
Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization Paper
Exploring and evaluating hallucinations in llm-powered code generation Paper
EureQA: Deceiving semantic shortcuts on reasoning chains: How far can models go without hallucination Paper Code
TofuEval: Evaluating hallucinations of llms on topic-focused dialogue summarization Paper Code
Object hallucination in image captioning Paper Code
Let there be a clock on the beach: Reducing object hallucination in image captioning Paper Code
Evaluating object hallucination in large vision-language models Paper Code
LLaVA-RLHF: Aligning large multimodal models with factually augmented RLHF Paper Code

2. Evaluating Logical Consistency

Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning Paper
Rethinking benchmark and contamination for language models with rephrased samples Paper Code
From Form(s) to Meaning: Probing the semantic depths of language models using multisense consistency Paper Code
Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity Paper Code
On sensitivity of learning with limited labelled data to the effects of randomness: Impact of interactions and systematic choices Paper
Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation Paper Code
Exploring multilingual human value concepts in large language models: Is value alignment consistent, transferable and controllable across languages? Paper Code
Fool your (vision and) language model with embarrassingly simple permutations Paper Code
Large language models are not robust multiple choice selectors Paper Code

3. Evaluating Invariance from Irrelevant Information

Large language models can be easily distracted by irrelevant context Paper
How easily do irrelevant inputs skew the responses of large language models? Paper Code
Lost in the middle: How language models use long context Paper Code
Making retrieval-augmented language models robust to irrelevant context Paper Code
Towards AI-complete question answering: A set of prerequisite toy tasks Paper Code
CLUTRR: A diagnostic benchmark for inductive reasoning from text Paper Code
Transformers as soft reasoners over language Paper Code
Do prompt-based models really understand the meaning of their prompts? Paper Code
MileBench: Benchmarking MLLMs in long context Paper Code
Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences Paper Code
Seedbench-2: Benchmarking multimodal large language models Paper Code
DEMON: Finetuning multimodal llms to follow zero-shot demonstrative instructions Paper Code