Awesome

Awesome Computer Use Agents

This is a collection of resources for computer-use agents, including videos, blogs, papers, and projects. The repository is currently under construction and will be continuously updated. We welcome contributions and feedback as we continue expanding this collection!

Table of Contents

Videos
Blogs
Papers
Projects

Videos

Blogs

Papers

Many papers here are organized and identified using a visualization tool developed by ranpox/openreview-visualization.

Modeling/Framework

[ICLR2025 Submission] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
[ICLR2025 Submission] Agent Workflow Memory
[ICLR2025 Submission] Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation
[ICLR2025 Submission] SpiritSight Agent: Advanced GUI Agent with One Look
[ICLR2025 Submission] Agent S: An Open Agentic Framework that Uses Computers Like a Human
[ICLR2025 Submission] Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
[ICLR2025 Submission] OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning
[ICLR2025 Submission] AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant
[ICLR2025 Submission] Cradle: Empowering Foundation Agents towards General Computer Control
[ICLR2025 Submission] Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
[ICLR2025 Submission] Simulate Before Act: Model-Based Planning for Web Agents
[ICLR2025 Submission] Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
[ICLR2025 Submission] NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator
[ICLR2025 Submission] Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents
[ICLR2025 Submission] AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
[ICLR2025 Submission] Digi-Q: Transforming VLMs to Device-Control Agents via Value-Based Offline RL
[ICLR2025 Submission] The Impact of Element Ordering on LM Agent Performance
[ICLR2025 Submission] Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems
[ICLR2025 Submission] Tree Search for Language Model Agents
[NeurIPS 2024] ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights
[LLM Agents Workshop@ICLR 2024] OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

Grounding

[ICLR2025 Submission] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
[ICLR2025 Submission] OS-ATLAS: Foundation Action Model for Generalist GUI Agents
[ICLR2025 Submission] UI-Pro: A Hidden Recipe for Building Vision-Language Models for GUI Grounding
[ICLR2025 Submission] Grounding Multimodal Large Language Model in GUI World
[ICLR2025 Submission] OmniParser for Pure Vision Based GUI Agent
[ICLR2025 Submission] Ferret-UI One: Mastering Universal User Interface Understanding Across Platforms
[ACL 2024] SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Agent Data

[ICLR2025 Submission] AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
[ICLR2025 Submission] AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs
[ICLR2025 Submission] GUI-World: A GUI-oriented Dataset for Multimodal LLM-based Agents
[NeurIPS 2024] Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

Evaluation

[ICLR2025 Submission] AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
[ICLR2025 Submission] CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
[ICLR2025 Submission] Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
[ICLR2025 Submission] AgentStudio: A Toolkit for Building General Virtual Agents
[ICLR2025 Submission] MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
[NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Safety

Attacking Vision-Language Computer Agents via Pop-ups
[ICLR2025 Submission] GuardAgent: Safeguard LLM Agent by a Guard Agent via Knowledge-Enabled Reasoning
[ICLR2025 Submission] EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage
Adversarial Attacks on Multimodal Agents

Projects