OmniParser: Revolutionizing Screen Understanding for Vision-Based GUI Agents
June 03, 2025
OmniParser
What is this project
OmniParser is a comprehensive screen parsing tool designed for pure vision-based GUI agents. It parses user interface screenshots into structured and easy-to-understand elements, significantly enhancing the ability of vision models like GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.
Main features
- Screen element detection and parsing into structured data
- Prediction of whether screen elements are interactable or not
- Icon functional description capabilities
- Fine-grained, small icon detection
- Local trajectory logging for building training data pipelines
- Integration with OmniTool for Windows 11 VM control
Target audience
- AI researchers working on vision-based agents
- Developers building GUI automation tools
- Teams creating training data pipelines for GUI interaction agents
How to use it
Installation
cd OmniParser
conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt
Download the model weights:
# download the model checkpoints to local directory OmniParser/weights/
for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence
Running the demo
Explore examples in demo.ipynb
or run the Gradio demo:
python gradio_demo.py
Project URL/repository
- GitHub Repository: https://github.com/microsoft/OmniParser
- HuggingFace Demo: HuggingFace Space Demo
- Model Weights: Models V2, Models V1.5
- Technical Report: arXiv Paper
Use cases/application scenarios
- Enhancing vision model capabilities for UI interaction
- Automating GUI testing and interaction
- Building training data pipelines for domain-specific agents
- Multi-agent orchestration for complex UI tasks
- Integration with LLMs like OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL), or Anthropic Computer Use
- GUI navigation and task automation
- Element detection and grounding for UI accessibility