Home

Awesome

<div align="center">

Banner Website arXiv License PRs

</div>

Windows Agent Arena (WAA) πŸͺŸ is a scalable Windows AI agent platform for testing and benchmarking multi-modal, desktop AI agents. WAA provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, where agentic AI workflows can be tested across a diverse range of tasks.

WAA supports the deployment of agents at scale using the Azure ML cloud infrastructure, allowing for the parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes, not days.

<div align="center"> <video src="https://github.com/user-attachments/assets/e0a8d88d-d28a-493d-b74f-2455f36c21f1" alt="waa_intro"> </div>

πŸ“’ Updates

πŸ“š Citation

Our technical report paper can be found here. If you find this environment useful, please consider citing our work:

@article{bonatti2024windows,
author = { Bonatti, Rogerio and Zhao, Dan and Bonacci, Francesco and Dupont, Dillon, and Abdali, Sara and Li, Yinheng and Wagle, Justin and Koishida, Kazuhito and Bucker, Arthur and Jang, Lawrence and Hui, Zack},
title = {Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale},
institution = {Microsoft},
year = {2024},
month = {September}, 
}

☝️ Pre-requisites:

<div align="center"> <img src="img/main.png" alt="main" height="200"/> </div>

Clone the repository and install dependencies:

git clone https://github.com/microsoft/WindowsAgentArena.git
cd WindowsAgentArena
# Install the required dependencies in your python environment
# conda activate winarena
pip install -r requirements.txt

πŸ’» Local deployment (WSL or Linux)

1. Configuration file

Create a new config.json at the root of the project with the necessary keys (from OpenAI or Azure endpoints):

{
    "OPENAI_API_KEY": "<OPENAI_API_KEY>", // if you are using OpenAI endpoint
    "AZURE_API_KEY": "<AZURE_API_KEY>",  // if you are using Azure endpoint
    "AZURE_ENDPOINT": "https://yourendpoint.openai.azure.com/", // if you are using Azure endpoint
}

2. Prepare the Windows Arena Docker Image

2.1 Pull the WinArena-Base Image from Docker Hub

To get started, pull the base image from Docker Hub:

docker pull windowsarena/winarena-base:latest

This image includes all the necessary dependencies (such as packages and models) required to run the code in the src directory.

2.2 Build the WinArena Image Locally

Next, build the WinArena image locally:

cd scripts
./build-container-image.sh

# If there are any changes in 'Dockerfile-WinArena-Base', use the --build-base-image flag to build also the base image locally
# ./build-container-image.sh --build-base-image true

# For other build options:
# ./build-container-image.sh --help

This will create the windowsarena/winarena:latest image with the latest code from the src directory.

3. Prepare the Windows 11 VM

<div align="center"> <video src="https://github.com/user-attachments/assets/6d55b9b5-3242-49af-be20-64f2086108b9" height="500" alt="local_prepare_golden_image"> </div>

3.1 Download Windows 11 Evaluation .iso file:

  1. Visit Microsoft Evaluation Center, accept the Terms of Service, and download a Windows 11 Enterprise Evaluation (90-day trial, English, United States) ISO file [~6GB]
  2. After downloading, rename the file to setup.iso and copy it to the directory WindowsAgentArena/src/win-arena-container/vm/image

3.2 Automatic Setup of the Windows 11 golden image:

Before running the arena, you need to prepare a new WAA snapshot (also referred as WAA golden image). This 30GB snapshot represents a fully functional Windows 11 VM with all the programs needed to run the benchmark. This VM additionally hosts a Python server which receives and executes agent commands. To learn more about the components at play, see our local and cloud components diagrams.

To prepare the gold snapshot, run once:

cd ./scripts
./run-local.sh --prepare-image true

You can monitor progress at http://localhost:8006. The preparation process is fully automated and will take ~20 minutes.

Please do not interfere with the VM while it is being prepared. It will automatically shut down when the provisioning process is complete.

<div align="center"> <img src="img/local_prepare_screen_unattend.png" alt="local_prepare_screen_unattend" height="500"/> </div> <div align="center"> <img src="img/local_prepare_screen_setup.png" alt="local_prepare_screen_setup" height="500"/> </div>

At the end, you should expect the Docker container named winarena to gracefully terminate as shown from the below logs.

<div align="center"> <img src="img/local_prepare_logs_successful.png" alt="local_prepare_logs_successful" height="200"/> </div> <br/>

You will find the 30GB WAA golden image in WindowsAgentArena/src/win-arena-container/vm/storage, consisting of the following files:

<div align="center"> <img src="img/local_prepare_storage_successful.png" alt="run_local_prepare_storage_successful" height="200"/> </div> <br/>
Additional Notes
cd ./scripts
find . -maxdepth 1 -type f -exec dos2unix {} +

4. Deploying the agent in the arena

4.1 Running the base benchmark

You're now ready to launch the evaluation. To run the baseline agent on all benchmark tasks, do:

cd scripts
./run-local.sh
# For client/agent options:
# ./run-local.sh --help

Open http://localhost:8006 to see the Windows VM with the agent running. If you have a beefy PC, you can instead run the strongest agent configuration in our paper by doing:

./run-local.sh --gpu-enabled true --som-origin mixed-omni --a11y-backend uia

At the end of the run you can display the results using the command:

cd src/win-arena-container/client
python show_results.py --result_dir <path_to_results_folder>

Available Configurations

Below is a comparison of various combinations of hyperparameters used by the Navi agent in our study, which can be overridden by specifying --som-origin <som_origin> --a11y-backend <a11y_backend> when running the run-local.sh script:

CommandDescriptionNotes
./run-local.sh --som-origin mixed-omni --a11y-backend uiaCombines Omniparser with accessibility tree information⭐Recommended for best results
./run-local.sh --som-origin omniUses Omniparser for screen understanding
./run-local.sh --som-origin ossUses webparse, groundingdino, and OCR (TesseractOCR)🌲Baseline
./run-local.sh --som-origin a11y --a11y-backend uiaUses slower, more accurate accessibility tree
./run-local.sh --som-origin a11y --a11y-backend win32Uses faster, less accurate accessibility treeπŸ‡Fastest
./run-local.sh --som-origin mixed-oss --a11y-backend uiaCombines oss detections with accessibility tree

4.2 Local development tips

At first sight it might seem challenging to develop/debug code running inside the docker container. However, we provide a few tips to make this process easier. Check the Development-Tips Doc for more details such as:

🌐 Azure Deployment -> Parallelizing the benchmark

We offer a seamless way to run the Windows Agent Arena on Azure ML Compute VMs. This option will significantly reduce the time needed to test your agent in all benchmark tasks from hours/days to minutes.

1. Set up the Azure resource group:

<div align="center"> <img src="img/azure_create_ml.png" alt="azure_create_ml" height="300"/> </div> <div align="center"> <img src="img/azure_ml_portal.png" alt="azure_ml_portal" height="200"/> </div> <div align="center"> <img src="img/azure_notebook.png" alt="azure_notebook" height="200"/> </div> <div align="center"> <img src="img/azure_quota.png" alt="azure_quota" height="200"/> </div>

2. Uploading Windows 11 and Docker images to Azure

3. Environment configurations and deployment

{
    ... // Your previous configs

    "AZURE_SUBSCRIPTION_ID": "<YOUR_AZURE_SUBSCRIPTION_ID>", 
    "AZURE_ML_RESOURCE_GROUP": "<YOUR_AZURE_ML_RESOURCE_GROUP>",
    "AZURE_ML_WORKSPACE_NAME": "<YOUR_AZURE_ML_WORKSPACE_NAME>"
}
{
  "experiment_1": {
    "ci_startup_script_path": "Users/<YOUR_USER>/compute-instance-startup.sh", // As seen in Section 1
    "agent": "navi",
    "datastore_input_path": "storage",
    "docker_img_name": "windowsarena/winarena:latest",
    "exp_name": "experiment_1",
    "num_workers": 4,
    "use_managed_identity": false,
    "json_name": "evaluation_examples_windows/test_all.json",
    "model_name": "gpt-4-1106-vision-preview",
    "som_origin": "oss", // or a11y, or mixed-oss
    "a11y_backend": "win32" // or uia
  }
  // ...
}
cd scripts
python run_azure.py --experiments_json "experiments.json" --update_json --exp_name "experiment_1" --ci_startup_script_path "Users/<YOUR_USER>/compute-instance-startup.sh" --agent "navi" --json_name "evaluation_examples_windows/test_all.json" --num_workers 4 --som_origin oss --a11y_backend win32
az login --use-device-code # https://learn.microsoft.com/en-us/cli/azure/install-azure-cli
# If multiple tenants or subscriptions, make sure to select the right ones with:
# az login --use-device-code --tenant "<YOUR_AZURE_AD_TENANT_ID>"
# az account set --subscription "<YOUR_AZURE_AD_TENANT_ID>"

# Make sure you have installed the python requirements in your conda environment
# conda activate winarena
# pip install -r requirements.txt

# From your activated conda environment:
cd scripts
python run_azure.py --experiments_json "experiments.json"

For any unfinished experiments in experiments.json, the script will:

  1. Create <num_workers Azure Compute Instance VMs.
  2. Run one ML Training Job named <exp_name> per VM.
  3. Dispose the VMs once the jobs are completed.

The logs from the run will be saved in a agent_outputs folder in the same blob container where you uploaded the Windows 11 image. You can download the agent_outputs folder to your local machine and run the show_azure.py script to see the results from every experiment as a markdown table.

cd scripts
python show_azure.py --json_config "experiments.json" --result_dir <path_to_downloaded_agent_outputs_folder>

πŸ€– BYOA: Bring Your Own Agent

Want to test your own agents in Windows Agent Arena? You can use our default agent as a template and create your own folder under src/win-arena-container/client/mm_agents. You just need to make sure that your agent.py file features predict() and reset() functions. For more information on agent development check out the BYOA Doc.

πŸ‘©β€πŸ’» Open-source contributions

We welcome contributions to the Windows Agent Arena project. In particular, we welcome:

If you are interested in contributing, please check out our Task Development Guidelines.

❓ FAQ

What are approximate running times and costs for the benchmark?

ComponentCostTime
Azure Standard_D8_v3 VM~$8 ($0.38/h * 40 * 0.5h)
GPT-4V$100~35min with 40 VMs
GPT-4o$100~35min with 40 VMs
GPT-4o-mini$15~30min with 40 VMs

How can I customizing resource allocation for local runs?

By default, the run-local.sh script attempts to create a QEMU VM with 8 GB of RAM and 8 CPU cores. If your system has limited resources, you can override these defaults by specifying the desired RAM and CPU allocation:

./run-local.sh --ram-size 4G --cpu-cores 4

How can I toggle support for KVM acceleration?

If your system does not support KVM acceleration, you can disable it by specifying the --use-kvm false flag:

./run-local.sh --use-kvm false

Note that running the benchmark locally without KVM acceleration is not recommended due to performance issues. In this case, we recommend preparing the golden image for later running the benchmark on Azure.

πŸ‘ Acknowledgements

🀝 Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

πŸ›‘οΈ Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.