Apple's MobileCLIP: Open-Source Mobile Vision Model

Apple’s MobileCLIP – The Open‑Source Mobile Vision Model

Apple released the MobileCLIP and MobileCLIP2 libraries in 2024‑2025, providing high‑accuracy image‑text models that run comfortably on iPhone hardware. The GitHub repository ml‑mobileclip ships the full implementation – from training pipelines to an end‑to‑end iOS app – making it an ideal case study for developers who want to build vision‑language features on mobile.


What is MobileCLIP?

MobileCLIP is a family of lightweight multi‑modal models that adapt the CLIP architecture using MobileOne backbones. It achieves state‑of‑the‑art zero‑shot performance while keeping parameters and latency minimal.

  • Variants: S0 (Small 0) to S4 (Small 4), B (standard), and L‑14 (large). The newer MobileCLIP2 adds training stability, larger datasets, and a new S3 / S4 pair.
  • Training data: DataCompDR‑1B (1 B image‑text pairs) for MobileCLIP; DFNDR‑2B for MobileCLIP2, plus optional synthetic captions from CoCa models.
  • Performance: On ImageNet‑1k zero‑shot top‑1, MobileCLIP‑S0 matches ViT‑B/16 while being 4.8× faster and 2.8× smaller. MobileCLIP2‑S4 reaches 81.9% accuracy with 19.6 ms latency on an iPhone 12 Pro Max.

Repo Highlights

Section What you’ll find
mobileclip/ Core model code (MobileOne, transformer heads, re‑parameterisation for efficient inference).
training/ Pipelines to fine‑tune on custom datasets, command‑line scripts, distributed training utilities.
eval/ Zero‑shot evaluation scripts (ImageNet, 38‑dataset benchmarks).
ios_app/ Swift project demonstrating real‑time zero‑shot classification on an iPhone.
docs/ Accuracy‑vs‑latency plots, architecture diagrams, licensing info.

All modules are pip‑installable (pip install -e .) and integrate smoothly with the popular OpenCLIP framework.


Quick‑Start: Inference with OpenCLIP

# 1️⃣ Create a virtual environment and install the library
conda create -n clipenv python=3.10 -y
conda activate clipenv
pip install -e .
import torch, open_clip
from PIL import Image
from mobileclip.modules.common.mobileone import reparameterize_model

model_name = "MobileCLIP2-S0"
#  If you want the pretrained model from HuggingFace
#  hf download apple/MobileCLIP2-S0 --> local checkpoint
model, _, preprocess = open_clip.create_model_and_transforms(
    model_name,
    pretrained="/path/to/mobileclip2_s0.pt",
    image_mean=(0,0,0), image_std=(1,1,1)
)
model.eval()
model = reparameterize_model(model)  # speed‑up for deployment

image = preprocess(Image.open("path/to/image.jpg").convert("RGB")).unsqueeze(0)
text = open_clip.get_tokenizer(model_name)(["a dog", "a cat", "a tree"])  # prompt list

with torch.no_grad():
    i_f = model.encode_image(image)
    t_f = model.encode_text(text)
    i_f /= i_f.norm(dim=-1, keepdim=True)
    t_f /= t_f.norm(dim=-1, keepdim=True)
    probs = torch.softmax(100.0 * i_f @ t_f.T, dim=-1)
    print("label probs:", probs)

The helper reparameterize_model() collapses batch‑norm layers, turning MobileCLIP into a pure‑Transformer that runs faster on GPU/CPU.


Zero‑Shot Evaluation

# Run the pre‑built script for ImageNet-1k
python eval/zeroshot_imagenet.py \
  --model-arch mobileclip_s0 \
  --model-path /path/to/mobileclip_s0.pt

The script outputs top‑1 accuracy, latency (ms), seen‑sample count, and an aggregated performance metric across 38 datasets.


Running the iOS Demo

The repo ships an iOS app that can be built with Xcode 15+:

  1. Clone the repo and locate the ios_app folder.
  2. Open AppleMobileCLIP.xcodeproj.
  3. Pull the pretrained mobileclip_s0.pt from HuggingFace or your local path.
  4. Add the file to the Xcode project under Resources.
  5. Hit Run – the app will load the model, capture from the camera, and classify frames in real time.

The demo showcases MobileCLIP2‑S0 running on an iPhone 12 Pro Max, achieving ~2 ms image‑text inference time.


Extending & Finetuning

  1. Custom Dataset – Drop a HuggingFace‑style dataset into training/ and modify train.py.
  2. New Architecture – Fork MobileOne, add your own head and re‑parameterise.
  3. Quantisation – Use OpenCLIP‑Q or PyTorch quantisation utilities to shrink latency further.

The community is encouraged to open issues for feature requests such as TensorRT export or Android support.


Performance in Context

Model Params (M) Latency (ms) ImageNet‑1k Top‑1 Avg. 38‑Data Acc
MobileCLIP‑S0 11.4 + 63.4 1.5 + 3.3 67.8 % 58.1 %
MobileCLIP2‑S0 11.4 + 63.4 1.5 + 3.3 71.5 % 59.7 %
MobileCLIP2‑S2 35.7 + 63.4 3.6 + 3.3 77.2 % 64.1 %
MobileCLIP2‑L‑14 304.3 + 123.6 57.9 + 6.6 81.9 % 67.8 %

These numbers illustrate that MobileCLIP2 can rival larger ViT models while running twice as fast on common hardware.


Final Thoughts

Apple’s MobileCLIP libraries provide a complete, production‑ready stack for vision‑language tasks that run on a phone. By shipping source code, pretrained checkpoints, evaluation scripts, and an iOS demo, the repo empowers both researchers and app developers to experiment, finetune, and release new services quickly. Whether you’re building an AR filter that detects objects via text prompts or pushing a zero‑shot classification backend to the edge, MobileCLIP gives you the tooling and performance needed to get there.

For more information, checkout the official GitHub repo, read the accompanying research papers, and explore the pretrained models on HuggingFace.

Original Article: View Original

Share this article