Home

Awesome

<div align="center">

pypi PyPI - License PyPI - Downloads GitHub star chart Code Format Tests

🛠️ Setup  |  🤖 Assistant  |  🚀 Launch Experiments  |  🔍 Analyse Results  |  <br> 🏆 Leaderboard  |  🤖 Build Your Agent  |  ↻ Reproducibility  |  💪 BrowserGym

<img src="https://github.com/user-attachments/assets/47a7c425-9763-46e5-be54-adac363be850" alt="agentlab-diagram" width="700"/>

Demo solving tasks:

</div>

[!WARNING] AgentLab is meant to provide an open, easy-to-use and extensible framework to accelerate the field of web agent research. It is not meant to be a consumer product. Use with caution!

AgentLab is a framework for developing and evaluating agents on a variety of benchmarks supported by BrowserGym. It is presented in more details in our BrowserGym ecosystem paper

AgentLab Features:

🎯 Supported Benchmarks

BenchmarkSetup <br> Link# Task <br> TemplateSeed <br> DiversityMax <br> StepMulti-tabHosted MethodBrowserGym <br> Leaderboard
WebArenasetup812None30yesself hosted (docker)soon
WorkArena L1setup33High30nodemo instancesoon
WorkArena L2setup341High50nodemo instancesoon
WorkArena L3setup341High50nodemo instancesoon
WebLinx-31586None1noself hosted (dataset)soon
VisualWebArenasetup910None30yesself hosted (docker)soon
AssistantBenchsetup214None30yeslive websoon
GAIA (soon)--None--live websoon
Mind2Web-live (soon)--None--live websoon
MiniWoBsetup125Medium10noself hosted (static files)soon

🛠️ Setup AgentLab

pip install agentlab

If not done already, install Playwright:

playwright install

Make sure to prepare the required benchmark according to the instructions provided in the setup column.

export AGENTLAB_EXP_ROOT=<root directory of experiment results>  # defaults to $HOME/agentlab_results
export OPENAI_API_KEY=<your openai api key> # if openai models are used
<details> <summary>Setup OpenRouter API</summary>
export OPENROUTER_API_KEY=<your openrouter api key> # if openrouter models are used
</details> <details> <summary>Setup Azure API</summary>
export AZURE_OPENAI_API_KEY=<your azure api key> # if using azure models
export AZURE_OPENAI_ENDPOINT=<your endpoint> # if using azure models
</details>

🤖 UI-Assistant

Use an assistant to work for you (at your own cost and risk).

agentlab-assistant --start_url https://www.google.com

Try your own agent:

agentlab-assistant --agent_config="module.path.to.your.AgentArgs"

🚀 Launch experiments

# Import your agent configuration extending bgym.AgentArgs class
# Make sure this object is imported from a module accessible in PYTHONPATH to properly unpickle
from agentlab.agents.generic_agent import AGENT_4o_MINI 

from agentlab.experiments.study import make_study

study = make_study(
    benchmark="miniwob",  # or "webarena", "workarnea_l1" ...
    agent_args=[AGENT_4o_MINI],
    comment="My first study",
)

study.run(n_jobs=5)

Relaunching incomplete or errored tasks

from agentlab.experiments.study import Study
study = Study.load("/path/to/your/study/dir")
study.find_incomplete(include_errors=True)
study.run()

See main.py to launch experiments with a variety of options. This is like a lazy CLI that is actually more convenient. Just comment and uncomment the lines you need or modify at will (but don't push to the repo).

Job Timeouts

The complexity of the wild web, Playwright, and asyncio can sometimes cause jobs to hang. This disables workers until the study is terminated and relaunched. If you are running jobs sequentially or with a small number of workers, this could halt your entire study until you manually kill and relaunch it. In the Ray parallel backend, we've implemented a system to automatically terminate jobs exceeding a specified timeout. This feature is particularly useful when task hanging limits your experiments.

Debugging

For debugging, run experiments with n_jobs=1 and use VSCode's debug mode. This allows you to pause execution at breakpoints.

About Parallel Jobs

Running one agent on one task corresponds to a single job. Conducting ablation studies or random searches across hundreds of tasks with multiple seeds can generate more than 10,000 jobs. Efficient parallel execution is therefore critical. Agents typically wait for responses from the LLM server or updates from the web server. As a result, you can run 10–50 jobs in parallel on a single computer, depending on available RAM.

⚠️ Note for (Visual)WebArena: These benchmarks have task dependencies designed to minimize "corrupting" the instance between tasks. For example, an agent on task 323 could alter the instance state, making task 201 impossible. To address this, the Ray backend accounts for task dependencies, enabling some degree of parallelism. On WebArena, you can disable dependencies to increase parallelism, but this might reduce performance by 1–2%.

⚠️ Instance Reset for (Visual)WebArena: Before evaluating an agent, the instance is automatically reset, a process that takes about 5 minutes. When evaluating multiple agents, the make_study function returns a SequentialStudies object to ensure proper sequential evaluation of each agent. AgentLab currently does not support evaluations across multiple instances, but you could either create a quick script to handle this or submit a PR to AgentLab. For a smoother parallel experience, consider using benchmarks like WorkArena instead.

🔍 Analyse Results

Loading Results

The class ExpResult provides a lazy loader for all the information of a specific experiment. You can use yield_all_exp_results to recursively find all results in a directory. Finally load_result_df gathers all the summary information in a single dataframe. See inspect_results.ipynb for example usage.

from agentlab.analyze import inspect_results

# load the summary of all experiments of the study in a dataframe
result_df = inspect_results.load_result_df("path/to/your/study")

# load the detailed results of the 1st experiment
exp_result = bgym.ExpResult(result_df["exp_dir"][0])
step_0_screenshot = exp_result.screenshots[0]
step_0_action = exp_result.steps_info[0].action

AgentXray

https://github.com/user-attachments/assets/06c4dac0-b78f-45b7-9405-003da4af6b37

In a terminal, execute:

agentlab-xray

You can load previous or ongoing experiments in the directory AGENTLAB_EXP_ROOT and visualize the results in a gradio interface.

In the following order, select:

Once this is selected, you can see the trace of your agent on the given task. Click on the profiling image to select a step and observe the action taken by the agent.

⚠️ Note: Gradio is still developing, and unexpected behavior has been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.

🏆 Leaderboard

Official unified leaderboard across all benchmarks.

Experiments are on their way for more reference points using GenericAgent. We are also working on code to automatically push a study to the leaderboard.

🤖 Implement a new Agent

Get inspiration from the MostBasicAgent in agentlab/agents/most_basic_agent/most_basic_agent.py. For a better integration with the tools, make sure to implement most functions in the AgentArgs API and the extended bgym.AbstractAgentArgs.

If you think your agent should be included directly in AgenLab, let us know and it can be added in agentlab/agents/ with the name of your agent.

↻ Reproducibility

Several factors can influence reproducibility of results in the context of evaluating agents on dynamic benchmarks.

Factors affecting reproducibility

Reproducibility Features

Misc

if you want to download HF models more quickly

pip install hf-transfer
pip install torch
export HF_HUB_ENABLE_HF_TRANSFER=1