Awesome
Awesome Large Multimodal Agents
Last update: 09/25/2024
<img src="./img/time.png" width="96%" height="96%"><font size=5><center><b> Table of Contents </b> </center></font>
Papers
Taxonomy
<img src="./img/table.png" width="96%" height="96%">Type Ⅰ
-
CLOVA - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
-
CRAFT - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
-
ViperGPT - ViperGPT: Visual Inference via Python Execution for Reasoning Github
-
HuggingGPT - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face Github
-
Chameleon - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models Github
-
Visual ChatGPT - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models Github
-
AssistGPT - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Github
-
M3 - Towards Robust Multi-Modal Reasoning via Model Selection Github
-
VisProgram - Visual Programming: Compositional visual reasoning without training
-
DDCoT - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models Github
-
ASSISTGUI - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation Github
-
GPT-Driver - GPT-Driver: Learning to Drive with GPT Github
-
LLaVA-Interactive - LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing Github
-
MusicAgent - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models Github
-
AudioGPT - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head Github
-
DroidBot-GPT - DroidBot-GPT: GPT-powered UI Automation for Android Github
-
GRID - GRID: A Platform for General Robot Intelligence Development Github
-
DEPS - Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents Github
-
MM-REACT - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Github
-
MuLan - MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion Github
-
Mobile-Agent - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception Github
-
SeeAct - GPT-4V(ision) is a Generalist Web Agent, if Grounded Github
Type Ⅱ
-
STEVE - See and Think: Embodied Agent in Virtual Environment Github
-
EMMA - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld Github
-
MLLM-Tool - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning Github
-
LLaVA-Plus - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills Github
-
GPT4Tools - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Github
-
WebWISE - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
-
Auto-UI - You Only Look at Screens: Multimodal Chain-of-Action Agents Github
Type Ⅲ
-
DoraemonGPT - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models Github
-
ChatVideo - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System Github
-
VideoAgent -- VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding Project page
Type Ⅳ
-
JARV IS-1 - JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models Github
-
AppAgent - AppAgent: Multimodal Agents as Smartphone Users Github
-
MM-Navigator - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation Github
-
Copilot - Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing Github
-
WavJourney - WavJourney: Compositional Audio Creation with Large Language Models Github
-
DLAH - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models Github
-
Cradle - Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study Github
-
VideoAgent -- VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding Project page
Multi-Agent
-
MP5 - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception Github
-
MemoDroid - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
-
Avis - avis: autonomous visual information seeking with large language model agent
-
Agent-Smith - Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast Github
-
GenAI - The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives Github
-
P2H - Propaganda to Hate: A Multimodal Analysis of Arabic Memes with Multi-Agent LLMs
Application
<img src="./img/app.png" width="96%" height="96%">💡 Complex Visual Reasoning Tasks
-
ViperGPT - ViperGPT: Visual Inference via Python Execution for Reasoning Github
-
HuggingGPT - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face Github
-
Chameleon - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models Github
-
Visual ChatGPT - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models Github
-
AssistGPT - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Github
-
LLaVA-Plus - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills Github
-
GPT4Tools - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Github
-
MLLM-Tool - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning Github
-
M3 - Towards Robust Multi-Modal Reasoning via Model Selection Github
-
VisProgram - Visual Programming: Compositional visual reasoning without training
-
DDCoT - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models Github
-
Avis - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
-
CLOVA - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
-
CRAFT - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
-
MuLan - MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion Github
🎵 Audio Editing & Generation
-
Copilot - Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing Github
-
MusicAgent - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models Github
-
AudioGPT - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head Github
-
WavJourney - WavJourney: Compositional Audio Creation with Large Language Models Github
-
OpenOmni - OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents Github
🤖 Embodied AI & Robotics
-
JARV IS-1 - JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models Github
-
DEPS - Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents Github
-
Octopus - Octopus: Embodied Vision-Language Programmer from Environmental Feedback Github
-
GRID - GRID: A Platform for General Robot Intelligence Development Github
-
MP5 - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception Github
-
STEVE - See and Think: Embodied Agent in Virtual Environment Github
-
EMMA - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld Github
-
MEIA - Multimodal Embodied Interactive Agent for Cafe Scene
🖱️💻 UI-assistants
-
AppAgent - AppAgent: Multimodal Agents as Smartphone Users Github
-
DroidBot-GPT - DroidBot-GPT: GPT-powered UI Automation for Android Github
-
WebWISE - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
-
Auto-UI - You Only Look at Screens: Multimodal Chain-of-Action Agents Github
-
MemoDroid - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
-
ASSISTGUI - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation Github
-
MM-Navigator - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation Github
-
AutoDroid - Empowering LLM to use Smartphone for Intelligent Task Automation Github
-
GPT-4V-Act - GPT-4V-Act: Chromium Copilot Github
-
Mobile-Agent - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception Github
-
[OpenAdapt]- OpenAdapt: AI-First Process Automation with Large Multimodal Models Github
-
[EnvDistraction]- Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions Github
🎨 Visual Generation & Editing
-
LLaVA-Interactive - LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing Github
-
MM-REACT - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Github
-
SeeAct - GPT-4V(ision) is a Generalist Web Agent, if Grounded Github
-
GenAI - The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives Github
-
GenArtist - GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing Github
🎥 Video Understanding
-
DoraemonGPT - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models Github
-
ChatVideo - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System Github
-
AssistGPT - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Github
-
VideoAgent-M -- VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding Project page
-
VideoAgent-L - VideoAgent: Long-form Video Understanding with Large Language Model as Agent Project page
-
Kubrick - Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation Github
-
Anim-Director - Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation Github
🚗 Autonomous Driving
-
GPT-Driver - GPT-Driver: Learning to Drive with GPT Github
-
DLAH - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models Github
🎮 Game-developer
-
SmartPlay - SmartPlay: A Benchmark for LLMs as Intelligent Agents Github
-
VisualWebArena - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks Github
-
Cradle - Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study Github
-
Cradle - Can AI Prompt Humans? Multimodal Agents Prompt Players’ Game Actions and Show Consequences to Raise Sustainability Awareness Github
Other
-
FinAgent - A Multimodal Foundation Agent for Financial Trading: Tool-Augmented, Diversified, and Generalist
-
VisionGPT - VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
-
WirelessAgent - WirelessAgent: Large Language Model Agents for Intelligent Wireless Networks
-
PhishAgent - PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection
-
MMRole - MMRole: A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents Github
Benchmark
-
SmartPlay - SmartPlay: A Benchmark for LLMs as Intelligent Agents Github
-
VisualWebArena - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks Github
-
Mind2Web - MIND2WEB: Towards a Generalist Agent for the Web Github
-
OmniACT - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist
-
DSBench - DSBENCH: HOW FAR ARE DATA SCIENCE AGENTS TO BECOMING DATA SCIENCE EXPERTS? Github