Pocket‑TTS: Lightweight CPU‑Only Text‑to‑Speech Library

Pocket‑TTS: Lightweight CPU‑Only Text‑to‑Speech Library

Text‑to‑Speech (TTS) has become an essential component in applications ranging from digital assistants to accessibility tools. Typical solutions demand powerful GPUs or rely on paid web APIs, making them impractical for edge devices or privacy‑conscious deployments. Pocket‑TTS solves this problem by delivering a high‑quality, low‑latency TTS experience that runs entirely on the CPU.

TL;DR – Pocket‑TTS is a 100 M‑parameter model that performs speech synthesis on 2 CPU cores, achieving ~200 ms first‑chunk latency and ~6× real‑time speed on a MacBook Air M4. Install with pip install pocket-tts or uv add pocket-tts and call pocket-tts generate from the CLI or TTSModel.load_model() from Python.

Table of Contents


Why Pocket‑TTS?

Feature Pocket‑TTS Competitor (Typical)
Model Size 100 M params 700 M–1 B+
Runtime 2 CPU cores GPU or TPUs
Latency ~200 ms 1–2 s
Deployment pip/uv install Docker + GPU, web API
Voice library 8 pre‑built voices + wav input Limited or none
Language English only (soon multi‑lang) Multi‑lang

Pocket‑TTS was designed around the idea of a TTS that fits in your pocket. The following aspects set it apart:

  1. CPU‑only – It runs on any modern CPU, no CUDA or GPU required.
  2. Tiny footprint – A 100 M‑parameter transformer model (~30 MB) keeps the repo lightweight.
  3. Audio streaming – You can stream audio in real time while the model continues generating.
  4. Voice cloning support – Provide a wav file to generate a voice‑state tailored to that audio.
  5. CLI & HTTP API – Simple command‑line commands and a fast local server for integration.

The result is a plug‑and‑play library that works out of the box on laptops, Raspberry Pi‑style SBCs, and edge inference chips.

Getting Started

Pocket‑TTS is a pure‑Python package that requires PyTorch 2.5+. The easiest way to get it is with uv (recommended for isolated envs) or pip.

Installing with uv

uv add pocket-tts
uvx pocket-tts generate "Hello World!"

uvx runs commands in a temporary environment; this is perfect for quick tests.

Installing with pip

pip install --upgrade pocket-tts
pocket-tts generate "Hello World!"

The example above will download the default voice (alba) and write a file tts_output.wav.

Colab Example

!pip install pocket-tts

from pocket_tts import TTSModel
import scipy.io.wavfile as wav

model = TTSModel.load_model()
voice_state = model.get_state_for_audio_prompt("alba")
audio = model.generate_audio(voice_state, "Hello from Colab!")
wav.write("output.wav", model.sample_rate, audio.cpu().numpy())

The first model load can take ~30 s, but subsequent calls stay in memory.

Voice Cloning

Pocket‑TTS supports voice cloning by accepting a local wav file or a Hugging Face hosted wav.

pocket-tts generate "This is my cloned voice" --voice ./my_voice.wav

You can also feed a Hugging Face URL:

pocket-tts generate "I love Hugging Face" \
  --voice "hf://kyutai/tts-voices/expresso/ex01-ex02_default_001_channel2_198s.wav"

The voice file must be ~16 kHz, 16‑bit PCM. In the repo you’ll find a list of example voices in the voice_catalog.md file.

Managing Voice States

If you need to use multiple voices in a short span, keep voice states in memory:

model = TTSModel.load_model()
voice_alba = model.get_state_for_audio_prompt("alba")
voice_marius = model.get_state_for_audio_prompt("marius")
# Reuse without re‑initializing the model

CLI & HTTP Server

CLI

Pocket‑TTS ships with two high‑level commands:

  • generate – Generates a wav file from text.
  • serve – Runs a local FastAPI HTTP server.

Run the server and visit the web UI:

uvx pocket-tts serve  # or pocket-tts serve if pip
Open http://localhost:8000 – the in‑browser interface is responsive and keeps the model in memory for ~6× real‑time speed.

HTTP API

The API exposes a /generate endpoint accepting POST JSON payload:

{ "text": "Hello, world!", "voice": "alba" }
You can call it with curl:
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"text":"Hello","voice":"alba"}' \
  --output out.wav

The server also streams audio in chunks if you set stream=true in the request.

Python API Integration

When embedding TTS in a service, you normally use the library directly.

from pocket_tts import TTSModel
import torch

model = TTSModel.load_model()
voice_state = model.get_state_for_audio_prompt("alba")
text = "Complex sentence that may be very long ..."

# Audio is a 1‑D torch Tensor
audio = model.generate_audio(voice_state, text)
# Save or stream
torch.save(audio, "demo.pt")

The API is intentionally lightweight: the model and voice states are normal PyTorch objects. Keep them cached if you replay the same voices or run in an async environment.

Performance & Benchmarks

The repo’s docs/tech_report.md provides an in‑depth analysis, but key takeaways are:

  • Latency – First chunk ~200 ms on an Apple M4; ~350 ms on Intel i7‑12700K.
  • Speed – Generates 6× real‑time on the same CPUs.
  • CPU Usage – Uses only 2 cores; memory stays under 2 GB.
  • Model size – 100 M parameters – 30 MB disk, ~120 MB VRAM when loaded.

The audio streaming nature means you can start playback before synthesis completes, which is critical for interactive applications.

Extending & Contributing

Pocket‑TTS is open source under the MIT license. Contributions are welcome in several areas:

  1. New voices – Fork the model, adjust the voice_catalog.md, and push a PR.
  2. Multi‑language support – Port the current English checkpoint.
  3. Quantization – Add int8 or float16 variants for even smaller runtime.
  4. WebAssembly – Integrate the Rust Candle version to run in browsers.

Development guidelines: - Run tests: pytest. - Build docs: mkdocs build. - CI uses uv to install dependencies.

For detailed instructions read CONTRIBUTING.md.

License & Usage Notes

Pocket‑TTS is released under the MIT license. Voice assets are subject to individual licenses (see voice_catalog.md).

Disclaimer – The repository explicitly prohibits misuse such as impersonation, disinformation, or abusive content. Always ensure lawful and ethical use.


TL;DR

Pocket‑TTS is a lightweight, CPU‑friendly TTS library that runs on any modern processor. Install, generate audio, or serve via HTTP in minutes. For developers looking for fast, open‑source TTS without GPU dependencies, Pocket‑TTS is an excellent choice.


Further Resources

Original Article: View Original

Share this article