OmniParser: Revolutionizing Screen Understanding for Vision-Based GUI Agents

OmniParser

Logo

What is this project

OmniParser is a comprehensive screen parsing tool designed for pure vision-based GUI agents. It parses user interface screenshots into structured and easy-to-understand elements, significantly enhancing the ability of vision models like GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.

Main features

  • Screen element detection and parsing into structured data
  • Prediction of whether screen elements are interactable or not
  • Icon functional description capabilities
  • Fine-grained, small icon detection
  • Local trajectory logging for building training data pipelines
  • Integration with OmniTool for Windows 11 VM control

Target audience

  • AI researchers working on vision-based agents
  • Developers building GUI automation tools
  • Teams creating training data pipelines for GUI interaction agents

How to use it

Installation

cd OmniParser
conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt

Download the model weights:

# download the model checkpoints to local directory OmniParser/weights/
for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence

Running the demo

Explore examples in demo.ipynb or run the Gradio demo:

python gradio_demo.py

Project URL/repository

Use cases/application scenarios

  • Enhancing vision model capabilities for UI interaction
  • Automating GUI testing and interaction
  • Building training data pipelines for domain-specific agents
  • Multi-agent orchestration for complex UI tasks
  • Integration with LLMs like OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL), or Anthropic Computer Use
  • GUI navigation and task automation
  • Element detection and grounding for UI accessibility

Share this article