VoxCPM2: Revolutionizing TTS with Tokenizer-Free Architecture

The Next Generation of Speech Synthesis

VoxCPM2 represents a quantum leap in text-to-speech technology. This 2B parameter model, built on MiniCPM-4 backbone, eliminates traditional tokenization bottlenecks through its innovative diffusion autoregressive architecture. Trained on 2M+ hours of multilingual speech, it delivers studio-quality 48kHz audio across 30 languages without requiring language tags.

✨ Key Innovations

🎨 Voice Design from Text Alone

Create entirely new voices using natural language: (Young female, warm gentle tone, slight smile) generates unique voices without reference audio.

🎛️ Controllable Voice Cloning

Clone any voice from short clips while controlling emotion, pace, and style: (slightly faster, cheerful) preserves timbre while adjusting expression.

🎙️ Ultimate Cloning Fidelity

Provide reference audio + transcript for pixel-perfect vocal reproduction, capturing every nuance of timbre, rhythm, and emotion.

🚀 Lightning-Fast Implementation

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained("openbmb/VoxCPM2")
wav = model.generate("Hello from VoxCPM2!", cfg_value=2.0)
sf.write("output.wav", wav, 48000)

Performance: RTF ~0.13 on RTX 4090 with Nano-vLLM (batched serving), ~8GB VRAM.

🌍 30-Language Coverage

Arabic, Chinese dialects (8+), English, French, German, Hindi, Japanese, Korean, Spanish, Thai, Vietnamese + 20 more.

📊 Benchmark Dominance

Model	Params	EN WER	ZH CER	SIM Score
VoxCPM2	2B	1.84%	0.97%	85.4% (EN)
Qwen3-TTS	1.7B	1.23%	1.22%	77.5%
FishAudio S2	4B	0.99%	0.54%	79.7%

🔧 Production Ready

CLI: voxcpm clone --reference-audio voice.wav
Web Demo: python app.py
LoRA Fine-tuning: 5-10min audio adapts to new speakers
Nano-vLLM: High-throughput async serving

📦 Get Started Now

pip install voxcpm

Fully Apache 2.0 licensed - commercial use welcome. Join 10K+ stars on GitHub and experience SOTA TTS today!

Live Playground | Hugging Face Weights