VisionClaw: Real-Time Gemini AI Assistant for Smart Glasses
VisionClaw – A Real‑Time AI Assistant for Meta Ray‑Ban Smart Glasses
The VisionClaw project demonstrates how to turn the Meta Ray‑Ban glasses (or any phone camera) into a hands‑free, voice‑and‑vision assistant. Powered by Google’s Gemini Live API for multimodal conversation and optionally the OpenClaw gateway for agentic tool‑calling, the app lets users:
- Ask "What am I looking at?" and get a spoken description of the scene.
- Add grocery items, create reminders, or send instant messages via WhatsApp, Telegram or iMessage.
- Search the web, control smart‑home devices or manage notes without touching a screen.
- Stream the glasses view live to a browser for remote viewing or collaboration.
Why VisionClaw? VisionClaw is not just a code sample—it’s a fully functional, end‑to‑end pipeline that blends iOS/Android development with real‑world AI services. It serves as a template for developers who want to build AR applications that combine visual perception, natural‑language interaction, and automation.
Project Overview
| Feature | iOS (Swift) | Android (Java/Kotlin) |
|---|---|---|
| Real‑time voice + vision | Yes | Yes |
| Gemini Live WebSocket | Yes | Yes |
| OpenClaw tool‑calling | Optional | Optional |
| Phone‑mode testing | Yes | Yes |
| WebRTC streaming | Yes | Yes |
| SDK dependencies | Meta DAT SDK, OpenClaw | Meta DAT SDK, OpenClaw |
The repo structure: - samples/ – Separate camera‑access projects for iOS and Android. - assets/ – Screenshots, architecture diagram, teaser image. - README.md – Full doc, quick start, architecture notes. - CHANGELOG.md – Release history. - LICENSE – MIT license.
Quick Start
1️⃣ Clone the repository
git clone https://github.com/sseanliu/VisionClaw.git
2️⃣ iOS Setup
- Open samples/CameraAccess/CameraAccess.xcodeproj in Xcode 15+.
- Copy the example secrets file:
cp CameraAccess/Secrets.swift.example CameraAccess/Secrets.swift. - Edit
Secrets.swift– insert your Gemini API key and, if desired, OpenClaw settings. - Choose an iPhone target and press Run (⌘R).
- In the app, tap Start on iPhone (camera mode) or Start Streaming (glasses mode). Then press the AI button to converse.
3️⃣ Android Setup
- Open samples/CameraAccessAndroid in Android Studio.
- Configure GitHub Packages: add a
github_tokenwithread:packagesscope tolocal.properties. - Copy the secrets example:
cp secrets.kt.example secrets.ktand fill in your Gemini key. - Sync Gradle and Run on a device (Shift+F10).
- Tap Start on Phone or Start Streaming then use the AI button.
4️⃣ (Optional) Hook into OpenClaw
OpenClaw brings agentic actions like posting to Slack, adding calendars events, or controlling Philips Hue lights.
1. Install and run the OpenClaw gateway on your Mac.
2. Configure the host, port and token in Secrets.swift or Secrets.kt.
3. In the app’s settings, enable the OpenClaw section.
4. Test a task such as “Add milk to my shopping list” – the gateway executes it!
Architecture Snapshot

- Camera / Mic – Captures video frames (~1 fps) and audio (16 kHz PCM).
- App Layer – Sends frames & audio via Gemini Live WebSocket (binary).
- Gemini Live – Processes multimodal input; returns spoken audio, text, and tool‑calls.
- OpenClaw (optional) – Receives tool‑calls, performs actions via its 56+ skill APIs, returns results.
- Audio Pipeline – Streams Gemini’s 24 kHz PCM back to the device’s speaker.
- WebRTC – Optional live streaming of the glasses view to a browser.
Troubleshooting & Tips
| Issue | Fix |
|---|---|
| Gemini not hearing me | Verify mic permission; adjust voice‑activity settings in the app. |
| OpenClaw connection timeout | Ensure phone & Mac share the same Wi‑Fi; confirm the gateway is running; use the correct Bonjour hostname. |
| Gradle sync 401 error | Token in local.properties must include read:packages scope. Use gh auth token or manual GitHub token. |
| No audio playback | Check RECORD_AUDIO and PLAY_AUDIO permissions; in Android 13+ grant manually via settings. |
| Camera not starting | Ensure CAMERA permission and proper lifecycle handling; test on a fresh device. |
Real‑World Use Cases
- Field Research – A scientist wearing Ray‑Ban glasses can ask about specimens on a hike and get an annotated description without pulling out a phone.
- Retail Assistants – Shop‑floor staff can add items to a cart or check stock information hands‑free.
- Remote Assistance – Engineers can stream their view to a remote expert while the AI handles voice commands.
- Accessibility – Visually impaired users can get real‑time scene descriptions coupled with action prompts.
Closing Thoughts
VisionClaw is a practical showcase of how multimodal large language models can be brought into everyday wearable devices. It blends cutting‑edge AI with reliable open‑source tool‑calling, all in a single GitHub repository with clear documentation. If you’re building the next generation of hands‑free assistants, VisionClaw is a solid foundation to start from and a springboard to even more ambitious projects.
Next Steps: Fork the repo, experiment with custom Gemini prompts, add new skills to OpenClaw, or integrate your own wearable SDK. Happy hacking!