Build Real‑Time Speech Recognition in Rust with Voxtral Mini
Introduction
In 2026 the AI landscape was still dominated by large, opaque cloud models. A handful of community‑driven projects started to bridge the gap by providing fully open‑source, real‑time inference that can live on a laptop or even in a browser tab. The Voxtral Mini 4B Realtime project is the most recent example. It implements Mistral’s Voxtral Mini model entirely in Rust, using the Burn ML framework, and exposes both a native command line interface (CLI) and a WebAssembly (WASM) package that runs on WebGPU.
This article walks you through the key concepts, architecture, benchmarks, and steps to run the model locally or in the browser.
The Engine: Voxtral Mini 4B Realtime
- Model type: Voice‑to‑text, causal encoder / decoder architecture.
- Weights: 4‑B parameters (~9 GB SafeTensors) or a 2.5 GB Q4‑GGUF quantized shard.
- Implementation: Pure Rust + Burn ML. The Burn crate provides JIT‑style tensor operations on top of CubeCL, which can target Vulkan, Metal or WebGPU.
- Features:
- GPU accelerated with wgpu (default)
- Native tokenizer (Tekken) or WASM‑compatible
- CLI with clap / indicatif progress bars
- HuggingFace Hub integration for downloading weights
- WASM compilation with wasm‑pack, including a live demo
Architecture in a Nutshell
Audio 16kHz mono → Mel‑spec (B,128,T)
↓
Causal encoder (32 layers, 1280 dim, 750‑token window)
↓
Conv‑downsample → Reshape [B,T/16,5120]
↓
Adapter (3072 dim)
↓
Autoregressive decoder (26 layers, 3072 dim, GQA)
↓
Token IDs → Text
The encoder produces 64‑token chunks of latent representation, while the decoder keeps a KV‑cache and emits a token per time step. A custom WGSL compute shader fuses dequantisation + matrix multiplication, yielding 4× faster decoding on Q4‑GGUF compared to a naïve f32 pass.
Performance Snapshot
The repo ships an in‑tree benchmark harness, reporting the following on a 16‑core NVIDIA DGX‑Spark:
| Path | Encode ms | Decode ms | RTF | Tokens/s | Memory |
|---|---|---|---|---|---|
| Q4 GGUF native | 1 021 | 5 578 | 0.416 | 19.4 | 703 MB |
| F32 native | 887 | 23 689 | 1.543 | 4.6 | 9.2 GB |
| Q4 GGUF WASM (browser) | – | – | ~14 | ~0.5 | (browser) |
RTF 0.416 means transcription completes in under half the audio duration—fast enough for live chat or call centers. The Q4 path is also <3 GB, making it feasible to ship client‑side models.
Quick Start – Native CLI
- Install Rust, Cargo and optional dependencies:
curl https://sh.rustup.rs -sSf | sh # if Rust missing sudo apt-get install libgl1-mesa-dev libvulkan-dev # for wgpu - Download the weights (≈9 GB) via the HuggingFace Hub:
uv run --with huggingface_hub \ hf download mistralai/Voxtral-Mini-4B-Realtime-2602 \ --local-dir models/voxtral - Run a transcription:
cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \ --audio audio.wav --model models/voxtral - For the Q4 quantised path (≈2.5 GB):
cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \ --audio audio.wav --gguf models/voxtral-q4.gguf --tokenizer models/voxtral/tekken.json
Quick Start – Browser Demo (WASM + WebGPU)
- Install build tools (
rustup target add wasm32-unknown-unknown,npm i -g wasm-pack bun). - Compile the WASM package:
wasm-pack build --target web --no-default-features --features wasm - Generate a temporary HTTPS cert (WebGPU requires a secure context):
openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:prime256v1 \ -keyout /tmp/voxtral-key.pem -out /tmp/voxtral-cert.pem \ -days 7 -nodes -subj "/CN=localhost" - Serve locally:
bun serve.mjs - Browse to
https://localhost:8443, accept the certificate, and click Load from Server. It will stream the 2.5 GB model shards into the page, after which you can record via the microphone or upload a.wavfile. - Optionally, point a browser at the hosted HuggingFace Space (https://huggingface.co/spaces/TrevorJS/voxtral-mini-realtime) to skip the manual setup.
Common Gotchas & Fixes
- Left‑padding issue: The original
mistral-commonpads the first 32 silence tokens which is insufficient for the Q4 decoder. The project increases the padding to 76 tokens (38 decoder positions). The patch lives insrc/audio/pad.rs. - Memory budgets: For browsers, the 2 GB ArrayBuffer limit demands shard the GGUF file into 512 MB chunks. A simple
split -b 512mscript generates them. - Work‑group limit: WebGPU limits 256 workgroups per dispatch. The repo patches
cubecl-wgputo cap the reduce kernel size. - GPU support: Your system must expose Vulkan, Metal or a WebGPU adapter. Without it, CI will skip GPU‑heavy tests.
Extending the Project
The repo is designed for experimentation:
| Area | How to start |
|---|---|
| Add a new quantisation scheme | Fork gguf module, implement new Dequant trait |
| Replace the encoder | Modify src/models/…/encoder.rs, re‑benchmark |
| Deploy as a server | Wrap the CLI logic in Actix‑web or Axum, add HTTP endpoints |
| Integrate more tokenisers | Replace the C tekken extension with a pure‑Rust tokenizer |
Conclusion
Voxtral Mini 4B Realtime in Rust demonstrates that low‑latency, high‑accuracy speech recognition can be delivered entirely client‑side, even in a browser. With a modest 2.5 GB quantised model, you get a real‑time factor of 0.4, outpacing most commercial APIs and staying 100 % open source. Whether you’re building a video‑chat assistant, a hands‑free transcription tool, or an educational demo, this project gives you a solid, well‑documented foundation to start from.
Happy hacking, and may your tokens stay fast and your audio clean!