Sopro – Lightweight Text‑to‑Speech with Zero‑Shot Voice Cloning

Sopro – Lightweight Text‑to‑Speech with Zero‑Shot Voice Cloning

Sopro is a compact, low‑budget English text‑to‑speech model that leverages dilated convolutional networks (à la WaveNet) and lightweight cross‑attention layers instead of the heavy Transformer stack that dominates the space. It’s built by Samuel Vitorino as a side project, trained on a single L40S GPU, and released under the Apache‑2.0 license.


Why Sopro Stands Out

Feature Why It Matters
169 M parameters Small enough that it runs comfortably on an M3 CPU (.25 RTF) while still delivering audible quality
Streaming Synthesis Real‑time generation for conversational AI & live demos
Zero‑shot Voice Cloning Clone a new voice with just 3‑12 s of reference audio – no fine‑tuning required
Fast CPU Generation 30 s of audio in ~7.5 s on an M3 base model – great for edge devices
Cross‑Attention & Conv‑based Maintains performance without the overhead of Transformer attention

Sopro is not state‑of‑the‑art in every metric, but it’s a demonstration that you can build a usable TTS system on modest hardware and open‑source it for community use.


Installation & Quick Start

From PyPI

pip install sopro

From the repository

git clone https://github.com/samuel-vitorino/sopro
cd sopro
pip install -e .

⚙️ Note: On Apple Silicon you’ll benefit from torch==2.6.0 and omitting torchvision for a ~3× speed‑up.

CLI Example

soprotts \ 
    --text "Sopro is a lightweight 169 million parameter text-to-speech model. Some of the main features are streaming, zero-shot voice cloning, and 0.25 real-time factor on the CPU." \ 
    --ref_audio ref.wav \ 
    --out out.wav

Python API – Non‑streaming

from sopro import SoproTTS

tts = SoproTTS.from_pretrained("samuel-vitorino/sopro", device="cpu")

wav = tts.synthesize(
    "Hello! This is a non-streaming Sopro TTS example.",
    ref_audio_path="ref.wav",
)

tts.save_wav("out.wav", wav)

Python API – Streaming

import torch
from sopro import SoproTTS

tts = SoproTTS.from_pretrained("samuel-vitorino/sopro", device="cpu")

chunks = []
for chunk in tts.stream(
    "Hello! This is a streaming Sopro TTS example.",
    ref_audio_path="ref.mp3",
):
    chunks.append(chunk.cpu())

wav = torch.cat(chunks, dim=-1)
tts.save_wav("out_stream.wav", wav)

Interactive Demo

Sopro ships with a lightweight FastAPI demo that you can run locally or in Docker.

pip install -r demo/requirements.txt
uvicorn demo.server:app --host 0.0.0.0 --port 8000

or with Docker:

docker build -t sopro-demo .
docker run --rm -p 8000:8000 sopro-demo

Open http://localhost:8000 to hear your text spoken in the cloned voice.


Best‑Practice Tips

  1. Reference Audio – Use a clear, quiet recording with minimal background noise. 3‑12 s is sufficient.
  2. Parameter Tweaking--style_strength controls the FiLM influence; raise it for stronger voice similarity.
  3. Stop Head – For short sentences, the early‑stopping head may fail; lower --stop_threshold or --stop_patience to improve reliability.
  4. Phoneme‑Based Text – Prefer spelled‑out words over abbreviations where possible; the model handles “CPU”, “TTS” well, but complex symbols can cause hiccups.
  5. Non‑Streaming Preference – For highest audio fidelity, use the non‑streaming API; streaming is mainly for UI latency.

Future Work & Community

  • Expand to additional languages – the current architecture supports any phonetic representation.
  • Better voice embedding by training on raw audio rather than pre‑tokenized clips.
  • Cache convolution states to further speed up repeated synthesis.
  • Publish the full training pipeline to enable community contributions.

If you find this project helpful, consider supporting the author: https://buymeacoffee.com/samuelvitorino.


Conclusion

Sopro shows that you can build a functional, fast, open‑source TTS system with just a fraction of the resources needed by many commercial models. Whether you’re prototyping a voice‑assistant, generating narrations for accessibility, or experimenting with voice cloning research, Sopro offers a practical, low‑budget entry point that’s ready to run on most modern CPUs.

Happy cloning!

Original Article: View Original

Share this article