Miso TTS 8B: A High-Quality Open-Source Text-to-Speech Model

Miso TTS 8B is a state-of-the-art, open-source text-to-speech model with 8 billion parameters, offering highly emotive speech generation and voice cloning capabilities.

Miso Labs has released Miso TTS 8B, an open-source text-to-speech model that pushes the boundaries of what's possible with AI-generated speech. With 8 billion parameters, this model is designed to produce highly emotive, natural-sounding speech that can be used for a variety of applications, from conversational AI to content creation.

What Makes Miso TTS 8B Special?

Miso TTS 8B is not just another TTS model. It's built on a sophisticated architecture that combines a large backbone transformer with a smaller audio decoder, allowing it to generate speech that is both expressive and contextually aware. The model is inspired by the Sesame CSM architecture and uses RVQ (Residual Vector Quantization) to produce high-quality audio codes from text input.

Key Features:

  • 8 Billion Parameters: The model's large size allows it to capture subtle nuances in speech, making it one of the most expressive TTS models available.
  • Voice Cloning: Miso TTS can condition on prior audio to clone voices, making it ideal for applications that require consistent speaker identity.
  • Conversational Context: The model can take interleaved text and audio tokens, allowing it to generate speech that fits naturally into a conversation history.
  • Watermarking: Generated audio is watermarked by default using SilentCipher, helping to prevent misuse and impersonation.

Architecture Deep Dive

Miso TTS 8B uses two transformer components:

  1. Backbone Transformer (8B parameters): This large model consumes text and audio-frame embeddings, processing the interleaved sequence to understand context and generate appropriate speech patterns.

  2. Audio Decoder (300M parameters): A smaller transformer that autoregressively predicts higher-order audio codebooks within each frame, refining the output from the backbone.

The model uses the Mimi audio tokenizer with 32 audio codebooks and a vocabulary of 2,051 audio tokens. The text vocabulary is 128,256 tokens, and the maximum sequence length is 2,048 tokens.

Getting Started

To run Miso TTS 8B locally, you'll need a GPU with at least 24GB of VRAM for bfloat16 inference. Here's how to get started:

Installation

First, install uv if you don't have it:

curl -LsSf https://astral.sh/uv/install.sh | sh

Then clone the repository and set up the environment:

git clone https://github.com/MisoLabsAI/MisoTTS.git
cd MisoTTS
uv sync --python 3.10
source .venv/bin/activate

Basic Usage

Run the example script to generate a conversation:

uv run python run_misotts.py

This will create a file named full_conversation.wav in the repository root.

Python API

For more control, you can use the Python API directly:

import torch
import torchaudio
from generator import load_miso_8b

device = "cuda" if torch.cuda.is_available() else "cpu"
generator = load_miso_8b(
    device=device,
    model_path_or_repo_id="MisoLabs/MisoTTS",
)

audio = generator.generate(
    text="Hello from Miso.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("miso.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Voice Cloning

To clone a voice, provide a prompt audio segment:

import torchaudio
from generator import Segment, load_miso_8b

generator = load_miso_8b(device="cuda")

prompt_audio, sample_rate = torchaudio.load("prompt.wav")
prompt_audio = torchaudio.functional.resample(
    prompt_audio.squeeze(0),
    orig_freq=sample_rate,
    new_freq=generator.sample_rate,
)

context = [
    Segment(
        speaker=0,
        text="This is the transcript for the prompt audio.",
        audio=prompt_audio,
    )
]

audio = generator.generate(
    text="This is the next sentence to synthesize.",
    speaker=0,
    context=context,
    max_audio_length_ms=10_000,
)

System Requirements

Miso TTS 8B is a large model and requires significant hardware:

Precision Weights (approx.) Recommended VRAM Example GPUs
bfloat16/fp16 ~16 GB 24 GB RTX 3090/4090, A5000, L4
float32 ~33 GB 40 GB+ A100 40 GB, A6000 48 GB, H100
  • CPU: Inference runs but is slow. Budget at least ~20 GB RAM for bfloat16 and ~40 GB for float32.
  • Disk: The first run downloads ~30–40 GB total (model checkpoint, Mimi codec, SilentCipher watermarker, Llama 3.2 tokenizer).

Safety and Ethical Use

Miso Labs emphasizes responsible use of this technology. The model should not be used to impersonate people, create deceptive audio, commit fraud, or generate harmful content. Generated audio is watermarked by default, and if you deploy this model, you should use your own private watermark key.

Conclusion

Miso TTS 8B represents a significant step forward in open-source text-to-speech technology. Its combination of high parameter count, voice cloning capabilities, and conversational context makes it a powerful tool for developers and researchers. While it requires substantial hardware, the quality of the output is well worth the investment.

For more information, visit the Miso Labs website or check out the model on Hugging Face.

Source

MisoLabsAI/MisoTTS: Miso TTS is an 8 billion, highly emotive text-to-speech model