WebLLM: Run LLMs In-Browser with WebGPU – Full Guide Here
The Ultimate Guide to WebLLM: In‑Browser LLMs with WebGPU
Large Language Models (LLMs) have become a staple of AI development, yet most of the ecosystem still relies on heavy cloud servers. WebLLM changes that paradigm by running modern LLMs entirely in the user’s browser, leveraging WebGPU for real‑time inference and preserving privacy.
In this guide we:
- Show how to install and set up WebLLM.
- Load and run popular models (Llama‑3, Phi‑3, Mistral, etc.).
- Use the OpenAI‑compatible API for chat, streaming, and JSON mode.
- Extend WebLLM with Web Workers, Service Workers, and Chrome extensions.
- Deploy a lightweight, local chatbot demo.
Let’s dive in.
1. What is WebLLM?
WebLLM (short for Web Large Language Model) is:
- In‑browser – No server calls.
- High‑performance – Uses WebGPU for hardware acceleration.
- OpenAI‑compatible – Same API you use in OpenAI’s SDK.
- Extensible – Plug in custom models or run on Web Workers.
- Production‑ready – Offers streaming, JSON mode, and planned function calling.
Quick Link: https://webllm.mlc.ai
2. Installation
# npm
npm install @mlc-ai/web-llm
# yarn
yarn add @mlc-ai/web-llm
# pnpm
pnpm add @mlc-ai/web-llm
CDN
If you prefer a script tag, WebLLM is available through jsDelivr:
<script type="module">
import * as webllm from "https://esm.run/@mlc-ai/web-llm";
// …
</script>
3. Building a Minimal Chat App
Below is a vanilla‑JS version. The same logic works in React, Vue, or any framework.
import { CreateMLCEngine } from '@mlc-ai/web-llm';
const initProgress = (p) => console.log(`Loading: ${p.percentage}%`);
const modelId = 'Llama-3.1-8B-Instruct-q4f32_1-MLC';
async function init() {
const engine = await CreateMLCEngine(modelId, { initProgressCallback: initProgress });
const chat = async (messages) => {
const result = await engine.chat.completions.create({ messages });
return result.choices[0].message.content;
};
// Demo call
const reply = await chat([
{ role: 'system', content: 'You are helpful.' },
{ role: 'user', content: 'Tell me a joke.' }
]);
console.log('AI:', reply);
}
init();
Key points:
- CreateMLCEngine both constructs the engine and loads the model.
- initProgressCallback reports download progress.
- The returned API mirrors OpenAI’s: chat.completions.create.
4. Streaming Chat
WebLLM supports stream:true, yielding an AsyncGenerator.
const streamGen = await engine.chat.completions.create({
messages,
stream: true,
stream_options: { include_usage: true }
});
let text = '';
for await (const chunk of streamGen) {
text += chunk.choices[0]?.delta.content ?? '';
console.log(text);
}
console.log('Total tokens:', chunk.usage?.total_tokens);
5. JSON Mode and Structured Output
const reply = await engine.chat.completions.create({
messages,
temperature: 0.2,
response_format: { type: 'json_object' }
});
const json = JSON.parse(reply.choices[0].message.content);
The engine ensures the response adheres to the schema when response_format is set.
6. Running on Web Workers
Using a dedicated Web Worker keeps UI responsive.
- worker.ts
import { WebWorkerMLCEngineHandler } from '@mlc-ai/web-llm'; const handler = new WebWorkerMLCEngineHandler(); self.onmessage = handler.onmessage; - main.ts
import { CreateWebWorkerMLCEngine } from '@mlc-ai/web-llm'; const worker = new Worker(new URL('./worker.ts', import.meta.url), { type: 'module' }); const engine = await CreateWebWorkerMLCEngine(worker, modelId, { initProgressCallback });
This pattern allows heavy compute to run off the main thread.
7. Service Workers for Offline Caching
With Service Workers, the model persists across page reloads; use it for offline or low‑latency experiences.
self.addEventListener('activate', () => {
/* init handler */
});
Then register in the main app:
navigator.serviceWorker.register('./sw.ts', { type: 'module' });
const engine = await CreateServiceWorkerMLCEngine(modelId, { initProgressCallback });
8. Custom Models
WebLLM can ingest any model compiled to MLC format. Provide model and model_lib URLs in an app config:
const appConfig = { model_list: [{
model: 'https://mydomain.com/my-llama',
model_id: 'MyLlama-3B',
model_lib: 'https://mydomain.com/myllama3b.wasm'
}] };
const engine = await CreateMLCEngine('MyLlama-3B', { appConfig });
This makes it trivial to bootstrap local or private models.
9. Chrome Extension Example
WebLLM ships with extension scaffolds – either a lightweight UI or a persistent background worker. Refer to the examples/chrome-extension folder in the repo for a full setup that injects a local assistant into any webpage.
10. Performance & Compatibility
| Feature | Supports | Notes |
|---|---|---|
| WebGPU | ✅ | Requires browsers that expose WebGPU (Chrome 113+, Edge 113+, Safari 17+). |
| WebAssembly | ✅ | Fallback for browsers lacking WebGPU, though slower. |
| OpenAI API | ✅ | Same endpoints, but works offline. |
| Streaming | ✅ | Real‑time responses via AsyncGenerator. |
| JSON mode | ✅ | Output validated against schema. |
| Function Calling | WIP | Experimental – use tooling hooks. |
| Web Workers | ✅ | Keeps UI thread free. |
| Service Workers | ✅ | Offline model caching. |
Hardware Acceleration
The primary speed boost comes from WebGPU. Benchmarks show 3x‑5x latency reduction over plain JS for inference of 8‑B models. On systems without GPUs, the engine seamlessly degrades to WebAssembly.
11. Getting Help & Contributing
- Documentation & Blog – https://webllm.mlc.ai/docs
- Discord Community – Join for quick help.
- GitHub Issues – Open problems or feature requests.
- Pull Requests – Contribute to code, examples, or documentation.
12. Wrap‑Up
WebLLM brings the power of LLMs to the browser without server costs or privacy trade‑offs. Whether you’re prototyping an AI chatbot, building a Chrome extension, or experimenting with custom models, the engine’s modular API and WebGPU acceleration make it a compelling choice.
Next steps?
- Try the playground at https://webllm.mlc.ai.
- Clone the repo and play with examples/get-started.
- Experiment with building your own Chrome extension.
Happy coding and see you in the browser!