Build Real‑Time Speech Recognition in Rust with Voxtral Mini

Introduction

In 2026 the AI landscape was still dominated by large, opaque cloud models.  A handful of community‑driven projects started to bridge the gap by providing fully open‑source, real‑time inference that can live on a laptop or even in a browser tab.  The Voxtral Mini 4B Realtime project is the most recent example.  It implements Mistral’s Voxtral Mini model entirely in Rust, using the Burn ML framework, and exposes both a native command line interface (CLI) and a WebAssembly (WASM) package that runs on WebGPU.

This article walks you through the key concepts, architecture, benchmarks, and steps to run the model locally or in the browser.

The Engine: Voxtral Mini 4B Realtime

  • Model type: Voice‑to‑text, causal encoder / decoder architecture.
  • Weights: 4‑B parameters (~9 GB SafeTensors) or a 2.5 GB Q4‑GGUF quantized shard.
  • Implementation: Pure Rust + Burn ML. The Burn crate provides JIT‑style tensor operations on top of CubeCL, which can target Vulkan, Metal or WebGPU.
  • Features:
  • GPU accelerated with wgpu (default)
  • Native tokenizer (Tekken) or WASM‑compatible
  • CLI with clap / indicatif progress bars
  • HuggingFace Hub integration for downloading weights
  • WASM compilation with wasm‑pack, including a live demo

Architecture in a Nutshell

Audio 16kHz mono → Mel‑spec (B,128,T)
    ↓
Causal encoder (32 layers, 1280 dim, 750‑token window)
    ↓
Conv‑downsample → Reshape [B,T/16,5120]
    ↓
Adapter (3072 dim)
    ↓
Autoregressive decoder (26 layers, 3072 dim, GQA)
    ↓
Token IDs → Text

The encoder produces 64‑token chunks of latent representation, while the decoder keeps a KV‑cache and emits a token per time step. A custom WGSL compute shader fuses dequantisation + matrix multiplication, yielding 4× faster decoding on Q4‑GGUF compared to a naïve f32 pass.

Performance Snapshot

The repo ships an in‑tree benchmark harness, reporting the following on a 16‑core NVIDIA DGX‑Spark:

Path Encode ms Decode ms RTF Tokens/s Memory
Q4 GGUF native 1 021 5 578 0.416 19.4 703 MB
F32 native 887 23 689 1.543 4.6 9.2 GB
Q4 GGUF WASM (browser) ~14 ~0.5 (browser)

RTF 0.416 means transcription completes in under half the audio duration—fast enough for live chat or call centers. The Q4 path is also <3 GB, making it feasible to ship client‑side models.

Quick Start – Native CLI

  1. Install Rust, Cargo and optional dependencies:
    curl https://sh.rustup.rs -sSf | sh  # if Rust missing
    sudo apt-get install libgl1-mesa-dev libvulkan-dev  # for wgpu
    
  2. Download the weights (≈9 GB) via the HuggingFace Hub:
    uv run --with huggingface_hub \
       hf download mistralai/Voxtral-Mini-4B-Realtime-2602 \
       --local-dir models/voxtral
    
  3. Run a transcription:
    cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \
      --audio audio.wav --model models/voxtral
    
  4. For the Q4 quantised path (≈2.5 GB):
    cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \
      --audio audio.wav --gguf models/voxtral-q4.gguf --tokenizer models/voxtral/tekken.json
    

Quick Start – Browser Demo (WASM + WebGPU)

  1. Install build tools (rustup target add wasm32-unknown-unknown, npm i -g wasm-pack bun).
  2. Compile the WASM package:
    wasm-pack build --target web --no-default-features --features wasm
    
  3. Generate a temporary HTTPS cert (WebGPU requires a secure context):
    openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:prime256v1 \
      -keyout /tmp/voxtral-key.pem -out /tmp/voxtral-cert.pem \
      -days 7 -nodes -subj "/CN=localhost"
    
  4. Serve locally:
    bun serve.mjs
    
  5. Browse to https://localhost:8443, accept the certificate, and click Load from Server. It will stream the 2.5 GB model shards into the page, after which you can record via the microphone or upload a .wav file.
  6. Optionally, point a browser at the hosted HuggingFace Space (https://huggingface.co/spaces/TrevorJS/voxtral-mini-realtime) to skip the manual setup.

Common Gotchas & Fixes

  • Left‑padding issue: The original mistral-common pads the first 32 silence tokens which is insufficient for the Q4 decoder. The project increases the padding to 76 tokens (38 decoder positions). The patch lives in src/audio/pad.rs.
  • Memory budgets: For browsers, the 2 GB ArrayBuffer limit demands shard the GGUF file into 512 MB chunks. A simple split -b 512m script generates them.
  • Work‑group limit: WebGPU limits 256 workgroups per dispatch. The repo patches cubecl-wgpu to cap the reduce kernel size.
  • GPU support: Your system must expose Vulkan, Metal or a WebGPU adapter. Without it, CI will skip GPU‑heavy tests.

Extending the Project

The repo is designed for experimentation:

Area How to start
Add a new quantisation scheme Fork gguf module, implement new Dequant trait
Replace the encoder Modify src/models/…/encoder.rs, re‑benchmark
Deploy as a server Wrap the CLI logic in Actix‑web or Axum, add HTTP endpoints
Integrate more tokenisers Replace the C tekken extension with a pure‑Rust tokenizer

Conclusion

Voxtral Mini 4B Realtime in Rust demonstrates that low‑latency, high‑accuracy speech recognition can be delivered entirely client‑side, even in a browser. With a modest 2.5 GB quantised model, you get a real‑time factor of 0.4, outpacing most commercial APIs and staying 100 % open source. Whether you’re building a video‑chat assistant, a hands‑free transcription tool, or an educational demo, this project gives you a solid, well‑documented foundation to start from.

Happy hacking, and may your tokens stay fast and your audio clean!

Original Article: View Original

Share this article