Pocket‑TTS: Lightweight CPU‑Only Text‑to‑Speech Library
Pocket‑TTS: Lightweight CPU‑Only Text‑to‑Speech Library
Text‑to‑Speech (TTS) has become an essential component in applications ranging from digital assistants to accessibility tools. Typical solutions demand powerful GPUs or rely on paid web APIs, making them impractical for edge devices or privacy‑conscious deployments. Pocket‑TTS solves this problem by delivering a high‑quality, low‑latency TTS experience that runs entirely on the CPU.
TL;DR – Pocket‑TTS is a 100 M‑parameter model that performs speech synthesis on 2 CPU cores, achieving ~200 ms first‑chunk latency and ~6× real‑time speed on a MacBook Air M4. Install with
pip install pocket-ttsoruv add pocket-ttsand callpocket-tts generatefrom the CLI orTTSModel.load_model()from Python.
Table of Contents
- Why Pocket‑TTS?
- Getting Started
- Voice Cloning
- CLI & HTTP Server
- Python API Integration
- Performance & Benchmarks
- Extending & Contributing
- License & Usage Notes
Why Pocket‑TTS?
| Feature | Pocket‑TTS | Competitor (Typical) |
|---|---|---|
| Model Size | 100 M params | 700 M–1 B+ |
| Runtime | 2 CPU cores | GPU or TPUs |
| Latency | ~200 ms | 1–2 s |
| Deployment | pip/uv install | Docker + GPU, web API |
| Voice library | 8 pre‑built voices + wav input | Limited or none |
| Language | English only (soon multi‑lang) | Multi‑lang |
Pocket‑TTS was designed around the idea of a TTS that fits in your pocket. The following aspects set it apart:
- CPU‑only – It runs on any modern CPU, no CUDA or GPU required.
- Tiny footprint – A 100 M‑parameter transformer model (~30 MB) keeps the repo lightweight.
- Audio streaming – You can stream audio in real time while the model continues generating.
- Voice cloning support – Provide a wav file to generate a voice‑state tailored to that audio.
- CLI & HTTP API – Simple command‑line commands and a fast local server for integration.
The result is a plug‑and‑play library that works out of the box on laptops, Raspberry Pi‑style SBCs, and edge inference chips.
Getting Started
Pocket‑TTS is a pure‑Python package that requires PyTorch 2.5+. The easiest way to get it is with uv (recommended for isolated envs) or pip.
Installing with uv
uv add pocket-tts
uvx pocket-tts generate "Hello World!"
uvxruns commands in a temporary environment; this is perfect for quick tests.
Installing with pip
pip install --upgrade pocket-tts
pocket-tts generate "Hello World!"
The example above will download the default voice (alba) and write a file tts_output.wav.
Colab Example
!pip install pocket-tts
from pocket_tts import TTSModel
import scipy.io.wavfile as wav
model = TTSModel.load_model()
voice_state = model.get_state_for_audio_prompt("alba")
audio = model.generate_audio(voice_state, "Hello from Colab!")
wav.write("output.wav", model.sample_rate, audio.cpu().numpy())
The first model load can take ~30 s, but subsequent calls stay in memory.
Voice Cloning
Pocket‑TTS supports voice cloning by accepting a local wav file or a Hugging Face hosted wav.
pocket-tts generate "This is my cloned voice" --voice ./my_voice.wav
You can also feed a Hugging Face URL:
pocket-tts generate "I love Hugging Face" \
--voice "hf://kyutai/tts-voices/expresso/ex01-ex02_default_001_channel2_198s.wav"
The voice file must be ~16 kHz, 16‑bit PCM. In the repo you’ll find a list of example voices in the
voice_catalog.mdfile.
Managing Voice States
If you need to use multiple voices in a short span, keep voice states in memory:
model = TTSModel.load_model()
voice_alba = model.get_state_for_audio_prompt("alba")
voice_marius = model.get_state_for_audio_prompt("marius")
# Reuse without re‑initializing the model
CLI & HTTP Server
CLI
Pocket‑TTS ships with two high‑level commands:
generate– Generates a wav file from text.serve– Runs a local FastAPI HTTP server.
Run the server and visit the web UI:
uvx pocket-tts serve # or pocket-tts serve if pip
http://localhost:8000 – the in‑browser interface is responsive and keeps the model in memory for ~6× real‑time speed.
HTTP API
The API exposes a /generate endpoint accepting POST JSON payload:
{ "text": "Hello, world!", "voice": "alba" }
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"text":"Hello","voice":"alba"}' \
--output out.wav
The server also streams audio in chunks if you set stream=true in the request.
Python API Integration
When embedding TTS in a service, you normally use the library directly.
from pocket_tts import TTSModel
import torch
model = TTSModel.load_model()
voice_state = model.get_state_for_audio_prompt("alba")
text = "Complex sentence that may be very long ..."
# Audio is a 1‑D torch Tensor
audio = model.generate_audio(voice_state, text)
# Save or stream
torch.save(audio, "demo.pt")
The API is intentionally lightweight: the model and voice states are normal PyTorch objects. Keep them cached if you replay the same voices or run in an async environment.
Performance & Benchmarks
The repo’s docs/tech_report.md provides an in‑depth analysis, but key takeaways are:
- Latency – First chunk ~200 ms on an Apple M4; ~350 ms on Intel i7‑12700K.
- Speed – Generates 6× real‑time on the same CPUs.
- CPU Usage – Uses only 2 cores; memory stays under 2 GB.
- Model size – 100 M parameters – 30 MB disk, ~120 MB VRAM when loaded.
The audio streaming nature means you can start playback before synthesis completes, which is critical for interactive applications.
Extending & Contributing
Pocket‑TTS is open source under the MIT license. Contributions are welcome in several areas:
- New voices – Fork the model, adjust the
voice_catalog.md, and push a PR. - Multi‑language support – Port the current English checkpoint.
- Quantization – Add int8 or float16 variants for even smaller runtime.
- WebAssembly – Integrate the Rust Candle version to run in browsers.
Development guidelines:
- Run tests: pytest.
- Build docs: mkdocs build.
- CI uses uv to install dependencies.
For detailed instructions read CONTRIBUTING.md.
License & Usage Notes
Pocket‑TTS is released under the MIT license. Voice assets are subject to individual licenses (see voice_catalog.md).
Disclaimer – The repository explicitly prohibits misuse such as impersonation, disinformation, or abusive content. Always ensure lawful and ethical use.
TL;DR
Pocket‑TTS is a lightweight, CPU‑friendly TTS library that runs on any modern processor. Install, generate audio, or serve via HTTP in minutes. For developers looking for fast, open‑source TTS without GPU dependencies, Pocket‑TTS is an excellent choice.