Sopro – Lightweight Text‑to‑Speech with Zero‑Shot Voice Cloning
Sopro – Lightweight Text‑to‑Speech with Zero‑Shot Voice Cloning
Sopro is a compact, low‑budget English text‑to‑speech model that leverages dilated convolutional networks (à la WaveNet) and lightweight cross‑attention layers instead of the heavy Transformer stack that dominates the space. It’s built by Samuel Vitorino as a side project, trained on a single L40S GPU, and released under the Apache‑2.0 license.
Why Sopro Stands Out
| Feature | Why It Matters |
|---|---|
| 169 M parameters | Small enough that it runs comfortably on an M3 CPU (.25 RTF) while still delivering audible quality |
| Streaming Synthesis | Real‑time generation for conversational AI & live demos |
| Zero‑shot Voice Cloning | Clone a new voice with just 3‑12 s of reference audio – no fine‑tuning required |
| Fast CPU Generation | 30 s of audio in ~7.5 s on an M3 base model – great for edge devices |
| Cross‑Attention & Conv‑based | Maintains performance without the overhead of Transformer attention |
Sopro is not state‑of‑the‑art in every metric, but it’s a demonstration that you can build a usable TTS system on modest hardware and open‑source it for community use.
Installation & Quick Start
From PyPI
pip install sopro
From the repository
git clone https://github.com/samuel-vitorino/sopro
cd sopro
pip install -e .
⚙️ Note: On Apple Silicon you’ll benefit from
torch==2.6.0and omittingtorchvisionfor a ~3× speed‑up.
CLI Example
soprotts \
--text "Sopro is a lightweight 169 million parameter text-to-speech model. Some of the main features are streaming, zero-shot voice cloning, and 0.25 real-time factor on the CPU." \
--ref_audio ref.wav \
--out out.wav
Python API – Non‑streaming
from sopro import SoproTTS
tts = SoproTTS.from_pretrained("samuel-vitorino/sopro", device="cpu")
wav = tts.synthesize(
"Hello! This is a non-streaming Sopro TTS example.",
ref_audio_path="ref.wav",
)
tts.save_wav("out.wav", wav)
Python API – Streaming
import torch
from sopro import SoproTTS
tts = SoproTTS.from_pretrained("samuel-vitorino/sopro", device="cpu")
chunks = []
for chunk in tts.stream(
"Hello! This is a streaming Sopro TTS example.",
ref_audio_path="ref.mp3",
):
chunks.append(chunk.cpu())
wav = torch.cat(chunks, dim=-1)
tts.save_wav("out_stream.wav", wav)
Interactive Demo
Sopro ships with a lightweight FastAPI demo that you can run locally or in Docker.
pip install -r demo/requirements.txt
uvicorn demo.server:app --host 0.0.0.0 --port 8000
or with Docker:
docker build -t sopro-demo .
docker run --rm -p 8000:8000 sopro-demo
Open http://localhost:8000 to hear your text spoken in the cloned voice.
Best‑Practice Tips
- Reference Audio – Use a clear, quiet recording with minimal background noise. 3‑12 s is sufficient.
- Parameter Tweaking –
--style_strengthcontrols the FiLM influence; raise it for stronger voice similarity. - Stop Head – For short sentences, the early‑stopping head may fail; lower
--stop_thresholdor--stop_patienceto improve reliability. - Phoneme‑Based Text – Prefer spelled‑out words over abbreviations where possible; the model handles “CPU”, “TTS” well, but complex symbols can cause hiccups.
- Non‑Streaming Preference – For highest audio fidelity, use the non‑streaming API; streaming is mainly for UI latency.
Future Work & Community
- Expand to additional languages – the current architecture supports any phonetic representation.
- Better voice embedding by training on raw audio rather than pre‑tokenized clips.
- Cache convolution states to further speed up repeated synthesis.
- Publish the full training pipeline to enable community contributions.
If you find this project helpful, consider supporting the author: https://buymeacoffee.com/samuelvitorino.
Conclusion
Sopro shows that you can build a functional, fast, open‑source TTS system with just a fraction of the resources needed by many commercial models. Whether you’re prototyping a voice‑assistant, generating narrations for accessibility, or experimenting with voice cloning research, Sopro offers a practical, low‑budget entry point that’s ready to run on most modern CPUs.
Happy cloning!