Qwen3‑TTS: Fast, Open‑Source Streaming TTS
Qwen3‑TTS: Fast, Open‑Source Streaming TTS
Alibaba’s Qwen3‑TTS is a cutting‑edge, open‑source text‑to‑speech (TTS) suite that combines high‑fidelity, low‑latency synthesis with flexible voice control. Built on a lightweight Discrete Multi‑Codebook LM architecture, Qwen3‑TTS delivers expressive and streaming speech generation in 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) while supporting custom voice cloning, voice design, and natural‑language instruction.
What Makes Qwen3‑TTS Stand Out?
| Feature | Description |
|---|---|
| Ultra‑low latency | Dual‑track streaming lets the model output the first audio packet after a single character. End‑to‑end latency can be as low as 97 ms. |
| Free‑form voice design | Use textual instructions (e.g., "Speak in a nervous tone") to generate voices that match a desired persona without additional training data. |
| Efficient cloning | Clone a target voice in 3 seconds with a short audio clip, producing high‑quality synthetic speech that preserves speaker identity. |
| Multi‑language coverage | 10 languages and many dialects with robust contextual understanding. |
| Open‑source & Hugging Space Integration | Release on GitHub with a public PyPI package, Hugging Face Hub, and a ready‑to‑run Gradio demo. |
| Lightweight deployment | Works on a single NVIDIA GPU with FlashAttention 2; no special hardware is required. |
These capabilities make Qwen3‑TTS ideal for real‑time applications such as chatbots, virtual assistants, audiobooks, and language learning tools.
Repository Highlights
- Models – 0.6 B, 1.7 B variants for base, custom‑voice, and voice‑design; each model is a self‑contained PyTorch model.
- Tokenizer –
Qwen3‑TTS‑Tokenizer‑12Hzprovides efficient acoustic compression (12 Hz codebooks) and high‑dimensional semantic mapping. - Documentation – Comprehensive README with architecture diagrams, evaluation tables, and extensive code samples.
- Demo – Gradio local UI (
qwen-tts-demo) for rapid prototyping.
Quick‑Start Guide
Below is a minimal example that installs the qwen-tts package, loads a custom‑voice model, and generates a Chinese sentence with a vivid voice instruction.
# 1. Create a clean environment
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
# 2. Install the library and optional FlashAttention
pip install -U qwen-tts
pip install -U flash-attn --no-build-isolation
# 3. Run a simple generation script
python - <<'PY'
import torch, soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
wav, sr = model.generate_custom_voice(
text="其实我真的有发现,我是一个特别善于观察别人情绪的人。",
language="Chinese",
speaker="Vivian",
instruct="用特别愤怒的语气说",
)
sf.write("output.wav", wav[0], sr)
print("Saved to output.wav")
PY
The output.wav file contains a high‑quality, aggressively angry voice spoken by the fictional speaker Vivian. This demonstrates the power of instruction‑driven voice control.
Voice Clone in Action
Clone a voice from a short clip and generate new content in a few seconds:
from qwen_tts import Qwen3TTSModel
import torch
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you."
wav, sr = model.generate_voice_clone(
text="We will test the quality of this cloned voice.",
language="English",
ref_audio=ref_audio,
ref_text=ref_text,
)
import soundfile as sf
sf.write("clone_output.wav", wav[0], sr)
The result is a seamless synthetic voice that retains the target speaker’s timbre and prosody.
Model Selection Cheat‑Sheet
| Model | Size | Base / Custom / Design | Stream | Instruction Control |
|---|---|---|---|---|
Qwen3-TTS-12Hz‑0.6B‑Base |
0.6 B | Base (clone) | ✅ | ✅ |
Qwen3‑TTS‑12Hz‑1.7B‑CustomVoice |
1.7 B | Custom | ✅ | ✅ |
Qwen3‑TTS‑12Hz‑1.7B‑VoiceDesign |
1.7 B | Design | ✅ | ✅ |
All models are publicly available on the Hugging Face Hub and can be downloaded via the qwen-tts PyPI package.
Fine‑Tuning & Evaluation
Qwen3‑TTS supports supervised fine‑tuning with custom datasets. The finetuning/prepare_data.py script demonstrates how to format your data, and the Qwen3TTSModel can be re‑trained with a standard PyTorch training loop. Evaluation metrics include Word Error Rate (WER), Cosine Similarity for speaker similarity, and Mixed Error Rate for cross‑lingual tests. The repository’s eval.py script reproduces the benchmarks from the Qwen3‑TTS technical report.
Deployment Options
| Platform | How to Deploy |
|---|---|
| Local GPU | qwen-tts-demo Gradio UI – qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base |
| Cloud (DashScope) | Use Alibaba Cloud’s DashScope real‑time API for both custom voice and voice‑clone endpoints |
| Edge | Off‑line inference with vLLM‑Omni – supports offline, single‑model inference with minimal RAM |
For secure deployments of the Base model, enable HTTPS in the Gradio demo with self‑signed certificates or a trusted CA.
Real‑World Use Cases
- Conversational Agents – Integrate Qwen3‑TTS with your chatbot back‑end to produce engaging, speaker‑adaptive responses.
- Audiobook Generation – Clone a narrator’s voice for consistent narration across millions of pages.
- Accessibility – Generate multilingual spoken explanations, preserving tone and emotion for users with visual impairments.
- Multilingual Voice Assistants – Use the 10‑language model to support global coverage with a single backbone.
Get Involved
The Qwen3‑TTS community welcomes contributions: - Bug reports – GitHub Issues - Feature requests – GitHub Discussions - Pull requests – Add new speaker profiles, languages, or improve performance - Dataset sharing – Provide custom audio‑text pairs for fine‑tuning
The model is released under the Apache‑2.0 license, enabling commercial and academic use.
Summary
Alibaba’s Qwen3‑TTS delivers a feature‑rich, low‑latency, open‑source TTS stack that supports advanced voice cloning, instruction‑driven voice design, and multilingual synthesis. With straightforward installation, real‑time streaming, and robust evaluation results, it’s ready for developers to prototype, iterate, and deploy high‑fidelity speech solutions. Try the demo or grab the models from Hugging Face and start building tomorrow’s voice technology today.