Qwen3‑TTS: Fast, Open‑Source Streaming TTS

January 25, 2026

Category: Practical Open Source Projects

Tags:

Open Source AI tts Speech Synthesis Alibaba Cloud

Qwen3‑TTS: Fast, Open‑Source Streaming TTS

Alibaba’s Qwen3‑TTS is a cutting‑edge, open‑source text‑to‑speech (TTS) suite that combines high‑fidelity, low‑latency synthesis with flexible voice control. Built on a lightweight Discrete Multi‑Codebook LM architecture, Qwen3‑TTS delivers expressive and streaming speech generation in 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) while supporting custom voice cloning, voice design, and natural‑language instruction.

What Makes Qwen3‑TTS Stand Out?

Feature	Description
Ultra‑low latency	Dual‑track streaming lets the model output the first audio packet after a single character. End‑to‑end latency can be as low as 97 ms.
Free‑form voice design	Use textual instructions (e.g., "Speak in a nervous tone") to generate voices that match a desired persona without additional training data.
Efficient cloning	Clone a target voice in 3 seconds with a short audio clip, producing high‑quality synthetic speech that preserves speaker identity.
Multi‑language coverage	10 languages and many dialects with robust contextual understanding.
Open‑source & Hugging Space Integration	Release on GitHub with a public PyPI package, Hugging Face Hub, and a ready‑to‑run Gradio demo.
Lightweight deployment	Works on a single NVIDIA GPU with FlashAttention 2; no special hardware is required.

These capabilities make Qwen3‑TTS ideal for real‑time applications such as chatbots, virtual assistants, audiobooks, and language learning tools.

Repository Highlights

Models – 0.6 B, 1.7 B variants for base, custom‑voice, and voice‑design; each model is a self‑contained PyTorch model.
Tokenizer – Qwen3‑TTS‑Tokenizer‑12Hz provides efficient acoustic compression (12 Hz codebooks) and high‑dimensional semantic mapping.
Documentation – Comprehensive README with architecture diagrams, evaluation tables, and extensive code samples.
Demo – Gradio local UI (qwen-tts-demo) for rapid prototyping.

Quick‑Start Guide

Below is a minimal example that installs the qwen-tts package, loads a custom‑voice model, and generates a Chinese sentence with a vivid voice instruction.

# 1. Create a clean environment
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

# 2. Install the library and optional FlashAttention
pip install -U qwen-tts
pip install -U flash-attn --no-build-isolation

# 3. Run a simple generation script
python - <<'PY'
import torch, soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wav, sr = model.generate_custom_voice(
    text="其实我真的有发现，我是一个特别善于观察别人情绪的人。",
    language="Chinese",
    speaker="Vivian",
    instruct="用特别愤怒的语气说",
)

sf.write("output.wav", wav[0], sr)
print("Saved to output.wav")
PY

The output.wav file contains a high‑quality, aggressively angry voice spoken by the fictional speaker Vivian. This demonstrates the power of instruction‑driven voice control.

Voice Clone in Action

Clone a voice from a short clip and generate new content in a few seconds:

from qwen_tts import Qwen3TTSModel
import torch

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you."

wav, sr = model.generate_voice_clone(
    text="We will test the quality of this cloned voice.",
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
)

import soundfile as sf
sf.write("clone_output.wav", wav[0], sr)

The result is a seamless synthetic voice that retains the target speaker’s timbre and prosody.

Model Selection Cheat‑Sheet

Model	Size	Base / Custom / Design	Stream	Instruction Control
`Qwen3-TTS-12Hz‑0.6B‑Base`	0.6 B	Base (clone)	✅	✅
`Qwen3‑TTS‑12Hz‑1.7B‑CustomVoice`	1.7 B	Custom	✅	✅
`Qwen3‑TTS‑12Hz‑1.7B‑VoiceDesign`	1.7 B	Design	✅	✅

All models are publicly available on the Hugging Face Hub and can be downloaded via the qwen-tts PyPI package.

Fine‑Tuning & Evaluation

Qwen3‑TTS supports supervised fine‑tuning with custom datasets. The finetuning/prepare_data.py script demonstrates how to format your data, and the Qwen3TTSModel can be re‑trained with a standard PyTorch training loop. Evaluation metrics include Word Error Rate (WER), Cosine Similarity for speaker similarity, and Mixed Error Rate for cross‑lingual tests. The repository’s eval.py script reproduces the benchmarks from the Qwen3‑TTS technical report.

Deployment Options

Platform	How to Deploy
Local GPU	`qwen-tts-demo` Gradio UI – `qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base`
Cloud (DashScope)	Use Alibaba Cloud’s DashScope real‑time API for both custom voice and voice‑clone endpoints
Edge	Off‑line inference with vLLM‑Omni – supports offline, single‑model inference with minimal RAM

For secure deployments of the Base model, enable HTTPS in the Gradio demo with self‑signed certificates or a trusted CA.

Real‑World Use Cases

Conversational Agents – Integrate Qwen3‑TTS with your chatbot back‑end to produce engaging, speaker‑adaptive responses.
Audiobook Generation – Clone a narrator’s voice for consistent narration across millions of pages.
Accessibility – Generate multilingual spoken explanations, preserving tone and emotion for users with visual impairments.
Multilingual Voice Assistants – Use the 10‑language model to support global coverage with a single backbone.

Get Involved

The Qwen3‑TTS community welcomes contributions: - Bug reports – GitHub Issues - Feature requests – GitHub Discussions - Pull requests – Add new speaker profiles, languages, or improve performance - Dataset sharing – Provide custom audio‑text pairs for fine‑tuning

The model is released under the Apache‑2.0 license, enabling commercial and academic use.

Summary

Alibaba’s Qwen3‑TTS delivers a feature‑rich, low‑latency, open‑source TTS stack that supports advanced voice cloning, instruction‑driven voice design, and multilingual synthesis. With straightforward installation, real‑time streaming, and robust evaluation results, it’s ready for developers to prototype, iterate, and deploy high‑fidelity speech solutions. Try the demo or grab the models from Hugging Face and start building tomorrow’s voice technology today.

Original Article: View Original

Share this article