VisionClaw: Real-Time Gemini AI Assistant for Smart Glasses

VisionClaw – A Real‑Time AI Assistant for Meta Ray‑Ban Smart Glasses

The VisionClaw project demonstrates how to turn the Meta Ray‑Ban glasses (or any phone camera) into a hands‑free, voice‑and‑vision assistant. Powered by Google’s Gemini Live API for multimodal conversation and optionally the OpenClaw gateway for agentic tool‑calling, the app lets users:

  • Ask "What am I looking at?" and get a spoken description of the scene.
  • Add grocery items, create reminders, or send instant messages via WhatsApp, Telegram or iMessage.
  • Search the web, control smart‑home devices or manage notes without touching a screen.
  • Stream the glasses view live to a browser for remote viewing or collaboration.

Why VisionClaw? VisionClaw is not just a code sample—it’s a fully functional, end‑to‑end pipeline that blends iOS/Android development with real‑world AI services. It serves as a template for developers who want to build AR applications that combine visual perception, natural‑language interaction, and automation.


Project Overview

Feature iOS (Swift) Android (Java/Kotlin)
Real‑time voice + vision Yes Yes
Gemini Live WebSocket Yes Yes
OpenClaw tool‑calling Optional Optional
Phone‑mode testing Yes Yes
WebRTC streaming Yes Yes
SDK dependencies Meta DAT SDK, OpenClaw Meta DAT SDK, OpenClaw

The repo structure: - samples/ – Separate camera‑access projects for iOS and Android. - assets/ – Screenshots, architecture diagram, teaser image. - README.md – Full doc, quick start, architecture notes. - CHANGELOG.md – Release history. - LICENSE – MIT license.


Quick Start

1️⃣ Clone the repository

git clone https://github.com/sseanliu/VisionClaw.git

2️⃣ iOS Setup

  1. Open samples/CameraAccess/CameraAccess.xcodeproj in Xcode 15+.
  2. Copy the example secrets file: cp CameraAccess/Secrets.swift.example CameraAccess/Secrets.swift.
  3. Edit Secrets.swift – insert your Gemini API key and, if desired, OpenClaw settings.
  4. Choose an iPhone target and press Run (⌘R).
  5. In the app, tap Start on iPhone (camera mode) or Start Streaming (glasses mode). Then press the AI button to converse.

3️⃣ Android Setup

  1. Open samples/CameraAccessAndroid in Android Studio.
  2. Configure GitHub Packages: add a github_token with read:packages scope to local.properties.
  3. Copy the secrets example: cp secrets.kt.example secrets.kt and fill in your Gemini key.
  4. Sync Gradle and Run on a device (Shift+F10).
  5. Tap Start on Phone or Start Streaming then use the AI button.

4️⃣ (Optional) Hook into OpenClaw

OpenClaw brings agentic actions like posting to Slack, adding calendars events, or controlling Philips Hue lights. 1. Install and run the OpenClaw gateway on your Mac. 2. Configure the host, port and token in Secrets.swift or Secrets.kt. 3. In the app’s settings, enable the OpenClaw section. 4. Test a task such as “Add milk to my shopping list” – the gateway executes it!


Architecture Snapshot

How It Works

  1. Camera / Mic – Captures video frames (~1 fps) and audio (16 kHz PCM).
  2. App Layer – Sends frames & audio via Gemini Live WebSocket (binary).
  3. Gemini Live – Processes multimodal input; returns spoken audio, text, and tool‑calls.
  4. OpenClaw (optional) – Receives tool‑calls, performs actions via its 56+ skill APIs, returns results.
  5. Audio Pipeline – Streams Gemini’s 24 kHz PCM back to the device’s speaker.
  6. WebRTC – Optional live streaming of the glasses view to a browser.

Troubleshooting & Tips

Issue Fix
Gemini not hearing me Verify mic permission; adjust voice‑activity settings in the app.
OpenClaw connection timeout Ensure phone & Mac share the same Wi‑Fi; confirm the gateway is running; use the correct Bonjour hostname.
Gradle sync 401 error Token in local.properties must include read:packages scope. Use gh auth token or manual GitHub token.
No audio playback Check RECORD_AUDIO and PLAY_AUDIO permissions; in Android 13+ grant manually via settings.
Camera not starting Ensure CAMERA permission and proper lifecycle handling; test on a fresh device.

Real‑World Use Cases

  • Field Research – A scientist wearing Ray‑Ban glasses can ask about specimens on a hike and get an annotated description without pulling out a phone.
  • Retail Assistants – Shop‑floor staff can add items to a cart or check stock information hands‑free.
  • Remote Assistance – Engineers can stream their view to a remote expert while the AI handles voice commands.
  • Accessibility – Visually impaired users can get real‑time scene descriptions coupled with action prompts.

Closing Thoughts

VisionClaw is a practical showcase of how multimodal large language models can be brought into everyday wearable devices. It blends cutting‑edge AI with reliable open‑source tool‑calling, all in a single GitHub repository with clear documentation. If you’re building the next generation of hands‑free assistants, VisionClaw is a solid foundation to start from and a springboard to even more ambitious projects.

Next Steps: Fork the repo, experiment with custom Gemini prompts, add new skills to OpenClaw, or integrate your own wearable SDK. Happy hacking!

Original Article: View Original

Share this article