Build Real‑Time Speech Recognition in Rust with Voxtral Mini

February 12, 2026

Category: Practical Open Source Projects

Tags:

Speech Recognition Rust wasm voxtral burn

Introduction

In 2026 the AI landscape was still dominated by large, opaque cloud models.  A handful of community‑driven projects started to bridge the gap by providing fully open‑source, real‑time inference that can live on a laptop or even in a browser tab.  The Voxtral Mini 4B Realtime project is the most recent example.  It implements Mistral’s Voxtral Mini model entirely in Rust, using the Burn ML framework, and exposes both a native command line interface (CLI) and a WebAssembly (WASM) package that runs on WebGPU.

This article walks you through the key concepts, architecture, benchmarks, and steps to run the model locally or in the browser.

The Engine: Voxtral Mini 4B Realtime

Model type: Voice‑to‑text, causal encoder / decoder architecture.
Weights: 4‑B parameters (~9 GB SafeTensors) or a 2.5 GB Q4‑GGUF quantized shard.
Implementation: Pure Rust + Burn ML. The Burn crate provides JIT‑style tensor operations on top of CubeCL, which can target Vulkan, Metal or WebGPU.
Features:
GPU accelerated with wgpu (default)
Native tokenizer (Tekken) or WASM‑compatible
CLI with clap / indicatif progress bars
HuggingFace Hub integration for downloading weights
WASM compilation with wasm‑pack, including a live demo

Architecture in a Nutshell

Audio 16kHz mono → Mel‑spec (B,128,T)
    ↓
Causal encoder (32 layers, 1280 dim, 750‑token window)
    ↓
Conv‑downsample → Reshape [B,T/16,5120]
    ↓
Adapter (3072 dim)
    ↓
Autoregressive decoder (26 layers, 3072 dim, GQA)
    ↓
Token IDs → Text

The encoder produces 64‑token chunks of latent representation, while the decoder keeps a KV‑cache and emits a token per time step. A custom WGSL compute shader fuses dequantisation + matrix multiplication, yielding 4× faster decoding on Q4‑GGUF compared to a naïve f32 pass.

Performance Snapshot

The repo ships an in‑tree benchmark harness, reporting the following on a 16‑core NVIDIA DGX‑Spark:

Path	Encode ms	Decode ms	RTF	Tokens/s	Memory
Q4 GGUF native	1 021	5 578	0.416	19.4	703 MB
F32 native	887	23 689	1.543	4.6	9.2 GB
Q4 GGUF WASM (browser)	–	–	~14	~0.5	(browser)

RTF 0.416 means transcription completes in under half the audio duration—fast enough for live chat or call centers. The Q4 path is also <3 GB, making it feasible to ship client‑side models.

Quick Start – Native CLI

Install Rust, Cargo and optional dependencies:

curl https://sh.rustup.rs -sSf | sh  # if Rust missing
sudo apt-get install libgl1-mesa-dev libvulkan-dev  # for wgpu

Download the weights (≈9 GB) via the HuggingFace Hub:

uv run --with huggingface_hub \
   hf download mistralai/Voxtral-Mini-4B-Realtime-2602 \
   --local-dir models/voxtral

Run a transcription:

cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \
  --audio audio.wav --model models/voxtral

For the Q4 quantised path (≈2.5 GB):

cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \
  --audio audio.wav --gguf models/voxtral-q4.gguf --tokenizer models/voxtral/tekken.json

Quick Start – Browser Demo (WASM + WebGPU)

Install build tools (rustup target add wasm32-unknown-unknown, npm i -g wasm-pack bun).

Compile the WASM package:

wasm-pack build --target web --no-default-features --features wasm

Generate a temporary HTTPS cert (WebGPU requires a secure context):

openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:prime256v1 \
  -keyout /tmp/voxtral-key.pem -out /tmp/voxtral-cert.pem \
  -days 7 -nodes -subj "/CN=localhost"

Serve locally:
```
bun serve.mjs
```
Browse to https://localhost:8443, accept the certificate, and click Load from Server. It will stream the 2.5 GB model shards into the page, after which you can record via the microphone or upload a .wav file.
Optionally, point a browser at the hosted HuggingFace Space (https://huggingface.co/spaces/TrevorJS/voxtral-mini-realtime) to skip the manual setup.

Common Gotchas & Fixes

Left‑padding issue: The original mistral-common pads the first 32 silence tokens which is insufficient for the Q4 decoder. The project increases the padding to 76 tokens (38 decoder positions). The patch lives in src/audio/pad.rs.
Memory budgets: For browsers, the 2 GB ArrayBuffer limit demands shard the GGUF file into 512 MB chunks. A simple split -b 512m script generates them.
Work‑group limit: WebGPU limits 256 workgroups per dispatch. The repo patches cubecl-wgpu to cap the reduce kernel size.
GPU support: Your system must expose Vulkan, Metal or a WebGPU adapter. Without it, CI will skip GPU‑heavy tests.

Extending the Project

The repo is designed for experimentation:

Area	How to start
Add a new quantisation scheme	Fork `gguf` module, implement new `Dequant` trait
Replace the encoder	Modify `src/models/…/encoder.rs`, re‑benchmark
Deploy as a server	Wrap the CLI logic in Actix‑web or Axum, add HTTP endpoints
Integrate more tokenisers	Replace the C `tekken` extension with a pure‑Rust tokenizer

Conclusion

Voxtral Mini 4B Realtime in Rust demonstrates that low‑latency, high‑accuracy speech recognition can be delivered entirely client‑side, even in a browser. With a modest 2.5 GB quantised model, you get a real‑time factor of 0.4, outpacing most commercial APIs and staying 100 % open source. Whether you’re building a video‑chat assistant, a hands‑free transcription tool, or an educational demo, this project gives you a solid, well‑documented foundation to start from.

Happy hacking, and may your tokens stay fast and your audio clean!

Original Article: View Original

Share this article