Home

Awesome

VisualAgentBench (VAB)

<p align="center"> <a href="" target="_blank">🌐 Website</a> | <a href="https://arxiv.org/abs/2408.06327" target="_blank">📃 Paper </a> | <a href="" target="_blank"> 🗂️ VAB Training (Under Construction) </p>

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

VisualAgentBench (VAB) is the first benchmark designed to systematically evaluate and develop large multi models (LMMs) as visual foundation agents, which comprises 5 distinct environments across 3 types of representative visual agent tasks (Embodied, GUI, and Visual Design)

https://github.com/user-attachments/assets/4a1a5980-48f9-4a70-a900-e5f58ded69b4

Compared to its predecessor AgentBench, VAB highlights visual inputs and the enabling of Foundation Agent capability development with training open LLMs/LMMs on trajectories.

Table of Contents

Dataset Summary

We offer two splits for each dataset: Testing and Training. Different from its predecessor AgentBench, VAB is accompanied with a trajectory training set for behavior cloning (BC) training, which allows development of more potent visual foundation agents with emerging open LMMs.

Leaderboard

Here is the scores on test set results of VAB. All metrics are task Success Rate (SR). Noted that proprietary LMMs are tested with mere Prompting, and open LMMs are tested after Multitask Finetuning on VAB training set, as they usually fail to follow complicated agent task instructions.

Quick Start

TODO

Acknowledgement

This project is heavily built upon the following repositories (to be updated):

Citation

@article{liu2024visualagentbench,
  title={VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents},
  author={Liu, Xiao and Zhang, Tianjie and Gu, Yu and Iong, Iat Long and Xu, Yifan and Song, Xixuan and Zhang, Shudan and Lai, Hanyu and Liu, Xinyi and Zhao, Hanlin and others},
  journal={arXiv preprint arXiv:2408.06327},
  year={2024}
}