Voicebox: Open‑Source Voice Studio Powered by Qwen3‑TTS

February 04, 2026

Category: Practical Open Source Projects

Tags:

Open Source Local AI voice synthesis Qwen3‑TTS audio editing

Voicebox: Open‑Source Voice Studio Powered by Qwen3‑TTS

Voice synthesis is no longer the domain of a handful of cloud‑based services. With Voicebox—a free, local‑first application built on Qwen3‑TTS—developers and creators can own their voice data, edit multi‑track audio like a digital audio workstation, and generate natural speech faster than ever on Apple Silicon.

What is Voicebox?

Local‑first: All inference, cloning, and editing run on your hardware—no internet needed, no subscription fees.
Open source: MIT licensed, fully community‑driven.
Multi‑track editing: Think DAW meets text‑to‑speech.
Built with modern stack: Tauri (Rust) for the desktop, React/TS for the UI, FastAPI for the API, MLX/Metal for GPU acceleration.
Powered by Qwen3‑TTS: Alibaba’s breakthrough model that can clone a voice from just a handful of seconds, producing high‑fidelity, expressive speech.

Core Features at a Glance

Feature	Description
Voice Cloning	Upload a short audio clip or record in‑app; the model outputs a reusable voice profile in a few seconds.
Timeline Editor	Arrange multiple voice tracks on a timeline, trim or split clips, and add markers—all with zero‑latency preview.
Multi‑Language Support	Currently English and Chinese, with more languages coming soon as Qwen3‑TTS expands.
Fast Inference on Apple Silicon	MLX backend with native Metal acceleration gives 4–5× speed gains on M1/M2 devices.
REST API	Exposes endpoints for `/generate`, `/profiles`, etc., with automatically generated open‑api docs.
Batch Generation	Create dozens of audio files in one request—perfect for long‑form content.
Transcription	Integrated Whisper model for on‑device transcription of recorded sessions.
Export Options	Export audio in WAV, MP3, or OGG, and export project files in JSON for backup or sharing.
Privacy & Security	No data leaves your machine unless you explicitly export a profile or project.

Architecture Snapshot

graph TD
  A[React‑TS Frontend] -->|REST| B[FastAPI Backend]
  B -->|PyTorch/MLX| C[Qwen3‑TTS Engine]
  B -->|Whisper| D[Transcription]
  B -->|SQLite| E[Database]
  subgraph Desktop
    F[tauri (Rust)] --> A
  end
  subgraph Web
    G[React‑TS app] --> A
  end

Frontend: React with TypeScript, Tailwind CSS, Zustand & React Query for state and data fetching.
Backend: FastAPI providing a typed API, automatic docs, and async performance.
Models: Qwen3‑TTS and Whisper are available both as PyTorch and MLX backends, giving platform flexibility.
Persistence: SQLite stores voice profiles, project metadata, and generation history.

How to Get Started

1. Install

# On macOS (Apple Silicon)
brew install qt@5  # for Tauri dependencies
bun install
cd backend && pip install -r requirements.txt
bun run dev   # Launch the desktop app

For Windows or Intel‑based macOS, download the MSI or ZIP from the releases page.

2. Clone a Voice

Open the app and click Create Profile.
Record or upload 5–10 seconds of clear speech.
The model will generate a profile called My Voice.
Export the profile if you want to share it.

3. Build a Story

Drag the new profile onto the timeline.
Type your script or paste from a document.
Use Batch Generation to synthesize the whole script.
Arrange clips, trim, and mix using the timeline tools.
Export the final mix.

Use Cases Where Voicebox Shines

Use Case	Why Voicebox Works	Example Application
Podcast Production	Full timeline editing, auto‑mixing, local privacy	Record host with voice cloning, auto‑mix guests
Game Dialogue	Batch generate dialogue lines for many characters	Dialogue NPCs with unique voices, instant re‑generation
Accessibility Tools	Offline speech synthesis for visually impaired	Screen reader or audiobooks on-device
Voice Assistant	Integrate local API with low latency	Build a custom assistant that never leaks data
Content Automation	Auto‑generate narrations for videos	Produce voice‑over for explainer videos at scale

Extending Voicebox

Plugin System: Add new voice models (e.g., XTTS, Bark) or audio effects as separate Tauri packages.
Mobile Companion: Future plans include a phone app to control a desktop Voicebox instance over a LAN.
Real‑Time Synthesis: A coming feature will stream generated audio as it is produced, enabling live performances.

Community & Contribution

Voicebox is built to be welcoming and open:

Contributing: Pull requests are encouraged; see CONTRIBUTING.md for setup.
Security: Follow SECURITY.md to report issues responsibly.
Releases: New stable builds are published on GitHub Releases.
Docs: Comprehensive API docs are available at http://localhost:8000/docs when running.

Bottom Line

Voicebox turns a laptop into a professional, privacy‑preserving voice studio. Whether you’re prototyping a speech‑based game, drafting a podcast, or building a personal accessibility tool, you no longer have to rely on costly cloud APIs. Jump in today, fork the GitHub repo, and start building voice experiences that stay under your control.

Original Article: View Original

Share this article