oMLX: Mac Menu Bar LLM Server with SSD Cache

Discover oMLX, the ultimate local LLM server for Apple Silicon Macs. Run LLMs, VLMs, and embeddings from your menu bar with continuous batching, tiered KV caching (RAM + SSD), and multi-model serving. Features admin dashboard, OpenAI API compatibility, Claude Code optimization, and one-click model downloads from Hugging Face. Install via DMG, Homebrew, or source – perfect for developers wanting production-grade local AI without cloud costs.

oMLX: Revolutionize Local AI on Your Mac with Menu Bar Control

Local LLMs on Apple Silicon just got a major upgrade. oMLX is an open-source inference server that combines production-grade features with dead-simple Mac integration. Forget terminal juggling – manage your LLMs directly from your menu bar.

Why oMLX Stands Out

Built on Apple's MLX framework, oMLX delivers:

  • Tiered KV Caching: Hot RAM tier + Cold SSD tier with prefix sharing and Copy-on-Write
  • Continuous Batching: Handle concurrent requests like vLLM
  • Multi-Model Serving: LLMs, VLMs, embeddings, and rerankers in one server
  • Native macOS App: Menu bar stats, auto-restart, in-app updates
  • Admin Dashboard: Real-time monitoring, model downloader, benchmarks, per-model settings

Killer Features for Developers

# Pin your daily models, auto-swap heavy ones
Pin: Qwen3-Coder-8bit, Step-3.5-Flash
Auto-load: gpt-oss-120b on demand

# SSD cache survives restarts
/hot-cache: 20GB RAM
/cold-cache: ~/.omlx/cache (SSD)

Claude Code Ready: Context scaling + SSE keep-alive prevents timeouts during long compilations.

Vision Superpowers: Qwen3.5-VL, GLM-4V, Pixtral with multi-image tool calling and OCR auto-detection.

Dead Simple Setup

# Homebrew (recommended)
brew tap jundot/omlx
git clone https://github.com/jundot/omlx
pip install -e .

# Launch and forget
brew services start omlx

Or grab the DMG from Releases – three clicks to first tokens.

OpenAI API Drop-In

POST http://localhost:8000/v1/chat/completions
curl -X POST http://localhost:8000/v1/embeddings

Full streaming usage stats, Anthropic Messages API, tool calling, and vision inputs (base64/URL).

Production Ready

  • Memory Enforcement: Total RAM limit prevents OOM
  • LRU + Manual + TTL: Sophisticated model eviction
  • Offline Admin: All CDN assets vendored
  • Structured Logging: Service + application logs

Benchmarks Speak Louder

Run from admin panel: Prefill tokens/sec, generation tokens/sec, cache hit rates. Real-world numbers, not synthetic fluff.

Get Started Today

2.4k GitHub stars and growing. Apache 2.0 licensed.

omlx serve --model-dir ~/models --max-model-memory 32GB

Your Mac's unified memory + oMLX = local AI that rivals cloud services. Install now and experience the future of on-device inference.