Voicebox: Open‑Source Voice Studio Powered by Qwen3‑TTS
Voicebox: Open‑Source Voice Studio Powered by Qwen3‑TTS
Voice synthesis is no longer the domain of a handful of cloud‑based services. With Voicebox—a free, local‑first application built on Qwen3‑TTS—developers and creators can own their voice data, edit multi‑track audio like a digital audio workstation, and generate natural speech faster than ever on Apple Silicon.
What is Voicebox?
- Local‑first: All inference, cloning, and editing run on your hardware—no internet needed, no subscription fees.
- Open source: MIT licensed, fully community‑driven.
- Multi‑track editing: Think DAW meets text‑to‑speech.
- Built with modern stack: Tauri (Rust) for the desktop, React/TS for the UI, FastAPI for the API, MLX/Metal for GPU acceleration.
- Powered by Qwen3‑TTS: Alibaba’s breakthrough model that can clone a voice from just a handful of seconds, producing high‑fidelity, expressive speech.
Core Features at a Glance
| Feature | Description |
|---|---|
| Voice Cloning | Upload a short audio clip or record in‑app; the model outputs a reusable voice profile in a few seconds. |
| Timeline Editor | Arrange multiple voice tracks on a timeline, trim or split clips, and add markers—all with zero‑latency preview. |
| Multi‑Language Support | Currently English and Chinese, with more languages coming soon as Qwen3‑TTS expands. |
| Fast Inference on Apple Silicon | MLX backend with native Metal acceleration gives 4–5× speed gains on M1/M2 devices. |
| REST API | Exposes endpoints for /generate, /profiles, etc., with automatically generated open‑api docs. |
| Batch Generation | Create dozens of audio files in one request—perfect for long‑form content. |
| Transcription | Integrated Whisper model for on‑device transcription of recorded sessions. |
| Export Options | Export audio in WAV, MP3, or OGG, and export project files in JSON for backup or sharing. |
| Privacy & Security | No data leaves your machine unless you explicitly export a profile or project. |
Architecture Snapshot
graph TD
A[React‑TS Frontend] -->|REST| B[FastAPI Backend]
B -->|PyTorch/MLX| C[Qwen3‑TTS Engine]
B -->|Whisper| D[Transcription]
B -->|SQLite| E[Database]
subgraph Desktop
F[tauri (Rust)] --> A
end
subgraph Web
G[React‑TS app] --> A
end
- Frontend: React with TypeScript, Tailwind CSS, Zustand & React Query for state and data fetching.
- Backend: FastAPI providing a typed API, automatic docs, and async performance.
- Models: Qwen3‑TTS and Whisper are available both as PyTorch and MLX backends, giving platform flexibility.
- Persistence: SQLite stores voice profiles, project metadata, and generation history.
How to Get Started
1. Install
# On macOS (Apple Silicon)
brew install qt@5 # for Tauri dependencies
bun install
cd backend && pip install -r requirements.txt
bun run dev # Launch the desktop app
For Windows or Intel‑based macOS, download the MSI or ZIP from the releases page.
2. Clone a Voice
- Open the app and click Create Profile.
- Record or upload 5–10 seconds of clear speech.
- The model will generate a profile called My Voice.
- Export the profile if you want to share it.
3. Build a Story
- Drag the new profile onto the timeline.
- Type your script or paste from a document.
- Use Batch Generation to synthesize the whole script.
- Arrange clips, trim, and mix using the timeline tools.
- Export the final mix.
Use Cases Where Voicebox Shines
| Use Case | Why Voicebox Works | Example Application |
|---|---|---|
| Podcast Production | Full timeline editing, auto‑mixing, local privacy | Record host with voice cloning, auto‑mix guests |
| Game Dialogue | Batch generate dialogue lines for many characters | Dialogue NPCs with unique voices, instant re‑generation |
| Accessibility Tools | Offline speech synthesis for visually impaired | Screen reader or audiobooks on-device |
| Voice Assistant | Integrate local API with low latency | Build a custom assistant that never leaks data |
| Content Automation | Auto‑generate narrations for videos | Produce voice‑over for explainer videos at scale |
Extending Voicebox
- Plugin System: Add new voice models (e.g., XTTS, Bark) or audio effects as separate Tauri packages.
- Mobile Companion: Future plans include a phone app to control a desktop Voicebox instance over a LAN.
- Real‑Time Synthesis: A coming feature will stream generated audio as it is produced, enabling live performances.
Community & Contribution
Voicebox is built to be welcoming and open:
- Contributing: Pull requests are encouraged; see
CONTRIBUTING.mdfor setup. - Security: Follow
SECURITY.mdto report issues responsibly. - Releases: New stable builds are published on GitHub Releases.
- Docs: Comprehensive API docs are available at
http://localhost:8000/docswhen running.
Bottom Line
Voicebox turns a laptop into a professional, privacy‑preserving voice studio. Whether you’re prototyping a speech‑based game, drafting a podcast, or building a personal accessibility tool, you no longer have to rely on costly cloud APIs. Jump in today, fork the GitHub repo, and start building voice experiences that stay under your control.