OmniParser

What is this project

OmniParser is a comprehensive screen parsing tool designed for pure vision-based GUI agents. It parses user interface screenshots into structured and easy-to-understand elements, significantly enhancing the ability of vision models like GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.

Main features

Screen element detection and parsing into structured data
Prediction of whether screen elements are interactable or not
Icon functional description capabilities
Fine-grained, small icon detection
Local trajectory logging for building training data pipelines
Integration with OmniTool for Windows 11 VM control

Target audience

AI researchers working on vision-based agents
Developers building GUI automation tools
Teams creating training data pipelines for GUI interaction agents

How to use it

Installation

cd OmniParser
conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt

Download the model weights:

# download the model checkpoints to local directory OmniParser/weights/
for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence

Running the demo

Explore examples in demo.ipynb or run the Gradio demo:

python gradio_demo.py

Project URL/repository

GitHub Repository: https://github.com/microsoft/OmniParser
HuggingFace Demo: HuggingFace Space Demo
Model Weights: Models V2, Models V1.5
Technical Report: arXiv Paper

Use cases/application scenarios

Enhancing vision model capabilities for UI interaction
Automating GUI testing and interaction
Building training data pipelines for domain-specific agents
Multi-agent orchestration for complex UI tasks
Integration with LLMs like OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL), or Anthropic Computer Use
GUI navigation and task automation
Element detection and grounding for UI accessibility