Awesome

OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

📢 [Project Page] [Blog Post] [Models] huggingface space

OmniParser is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.

News

[2024/11/26] We release an updated version, OmniParser V1.5 which features 1) more fine grained/small icon detection, 2) prediction of whether each screen element is interactable or not. Examples in the demo.ipynb.
[2024/10] OmniParser was the #1 trending model on huggingface model hub (starting 10/29/2024).
[2024/10] Feel free to checkout our demo on huggingface space! (stay tuned for OmniParser + Claude Computer Use)
[2024/10] Both Interactive Region Detection Model and Icon functional description model are released! Hugginface models
[2024/09] OmniParser achieves the best performance on Windows Agent Arena!

Install

Install environment:

conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt

Then download the model ckpts files in: https://huggingface.co/microsoft/OmniParser, and put them under weights/, default folder structure is: weights/icon_detect, weights/icon_caption_florence, weights/icon_caption_blip2.

For v1: convert the safetensor to .pt file.

python weights/convert_safetensor_to_pt.py

For v1.5: 
download 'model_v1_5.pt' from https://huggingface.co/microsoft/OmniParser/tree/main/icon_detect_v1_5, make a new dir: weights/icon_detect_v1_5, and put it inside the folder. No weight conversion is needed.

Examples:

We put together a few simple examples in the demo.ipynb.

Gradio Demo

To run gradio demo, simply run:

# For v1
python gradio_demo.py --icon_detect_model weights/icon_detect/best.pt --icon_caption_model florence2
# For v1.5
python gradio_demo.py --icon_detect_model weights/icon_detect_v1_5/model_v1_5.pt --icon_caption_model florence2

Model Weights License

For the model checkpoints on huggingface model hub, please note that icon_detect model is under AGPL license since it is a license inherited from the original yolo model. And icon_caption_blip2 & icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model: https://huggingface.co/microsoft/OmniParser.

📚 Citation

Our technical report can be found here. If you find our work useful, please consider citing our work:

@misc{lu2024omniparserpurevisionbased,
      title={OmniParser for Pure Vision Based GUI Agent}, 
      author={Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah},
      year={2024},
      eprint={2408.00203},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.00203}, 
}