VoxCPM2: 2B Multilingual TTS with Voice Cloning & Design

Discover VoxCPM2, the groundbreaking 2B parameter tokenizer-free TTS model supporting 30 languages with studio-quality 48kHz audio. Create voices from text descriptions, clone any speaker with perfect fidelity, and achieve real-time performance (RTF 0.13 on RTX 4090). Fully open-source under Apache 2.0 with Python API, CLI, web demo, LoRA fine-tuning, and production deployment ready. Outperforms commercial models across major TTS benchmarks.

VoxCPM2: Revolutionizing TTS with Tokenizer-Free Architecture

The Next Generation of Speech Synthesis

VoxCPM2 represents a quantum leap in text-to-speech technology. This 2B parameter model, built on MiniCPM-4 backbone, eliminates traditional tokenization bottlenecks through its innovative diffusion autoregressive architecture. Trained on 2M+ hours of multilingual speech, it delivers studio-quality 48kHz audio across 30 languages without requiring language tags.

✨ Key Innovations

🎨 Voice Design from Text Alone

Create entirely new voices using natural language: (Young female, warm gentle tone, slight smile) generates unique voices without reference audio.

🎛️ Controllable Voice Cloning

Clone any voice from short clips while controlling emotion, pace, and style: (slightly faster, cheerful) preserves timbre while adjusting expression.

🎙️ Ultimate Cloning Fidelity

Provide reference audio + transcript for pixel-perfect vocal reproduction, capturing every nuance of timbre, rhythm, and emotion.

🚀 Lightning-Fast Implementation

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained("openbmb/VoxCPM2")
wav = model.generate("Hello from VoxCPM2!", cfg_value=2.0)
sf.write("output.wav", wav, 48000)

Performance: RTF ~0.13 on RTX 4090 with Nano-vLLM (batched serving), ~8GB VRAM.

🌍 30-Language Coverage

Arabic, Chinese dialects (8+), English, French, German, Hindi, Japanese, Korean, Spanish, Thai, Vietnamese + 20 more.

📊 Benchmark Dominance

Model Params EN WER ZH CER SIM Score
VoxCPM2 2B 1.84% 0.97% 85.4% (EN)
Qwen3-TTS 1.7B 1.23% 1.22% 77.5%
FishAudio S2 4B 0.99% 0.54% 79.7%

🔧 Production Ready

  • CLI: voxcpm clone --reference-audio voice.wav
  • Web Demo: python app.py
  • LoRA Fine-tuning: 5-10min audio adapts to new speakers
  • Nano-vLLM: High-throughput async serving

📦 Get Started Now

pip install voxcpm

Fully Apache 2.0 licensed - commercial use welcome. Join 10K+ stars on GitHub and experience SOTA TTS today!

Live Playground | Hugging Face Weights