OmniParser: Revolutionizing Screen Understanding for Vision-Based GUI Agents

OmniParser revolutionizes screen parsing for vision-based GUI agents by transforming interface screenshots into structured data, enhancing model interaction capabilities, and providing powerful tools for AI researchers and developers building GUI automation solutions.

OmniParser

Logo

What is this project

OmniParser is a comprehensive screen parsing tool designed for pure vision-based GUI agents. It parses user interface screenshots into structured and easy-to-understand elements, significantly enhancing the ability of vision models like GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.

Main features

  • Screen element detection and parsing into structured data
  • Prediction of whether screen elements are interactable or not
  • Icon functional description capabilities
  • Fine-grained, small icon detection
  • Local trajectory logging for building training data pipelines
  • Integration with OmniTool for Windows 11 VM control

Target audience

  • AI researchers working on vision-based agents
  • Developers building GUI automation tools
  • Teams creating training data pipelines for GUI interaction agents

How to use it

Installation

cd OmniParser
conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt

Download the model weights:

# download the model checkpoints to local directory OmniParser/weights/
for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence

Running the demo

Explore examples in demo.ipynb or run the Gradio demo:

python gradio_demo.py

Project URL/repository

Use cases/application scenarios

  • Enhancing vision model capabilities for UI interaction
  • Automating GUI testing and interaction
  • Building training data pipelines for domain-specific agents
  • Multi-agent orchestration for complex UI tasks
  • Integration with LLMs like OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL), or Anthropic Computer Use
  • GUI navigation and task automation
  • Element detection and grounding for UI accessibility