NeuTTS Air: On-Device Voice AI with Instant Cloning

NeuTTS Air: Revolutionizing On-Device Voice AI

For too long, state-of-the-art voice AI has been confined to web APIs, limiting its accessibility and potential. Neuphonic's NeuTTS Air breaks these barriers, introducing the world's first super-realistic, on-device Text-to-Speech (TTS) speech language model with instant voice cloning capabilities.

Built upon a robust 0.5B LLM backbone, NeuTTS Air delivers natural-sounding speech, real-time performance, and integrated security features directly to your local device. This innovation unlocks a new era for embedded voice agents, intelligent assistants, interactive toys, and applications requiring compliance-safe, offline voice synthesis.

Key Features of NeuTTS Air:

  • Unrivaled Realism: Produces exceptionally natural and ultra-realistic voices, setting a new standard for on-device TTS. It achieves a level of human-like audio quality that is remarkable for its size and local processing capabilities.
  • Optimized for On-Device Deployment: Available in the highly efficient GGML format, NeuTTS Air is designed to run seamlessly on a wide range of devices, including smartphones, laptops, and even resource-constrained platforms like Raspberry Pis.
  • Instant Voice Cloning: With as little as 3 seconds of audio, you can create a personalized speaker, allowing for dynamic and customized voice interactions.
  • Efficient Architecture: Leveraging a simple LM + codec architecture built off a 0.5B backbone, it strikes the perfect balance between speed, size, and audio quality, making it ideal for real-world applications.
  • Advanced Audio Codec: Features NeuCodec, a proprietary 50Hz neural audio codec that ensures exceptional audio fidelity at low bitrates using a single codebook.
  • Watermarked Outputs: For responsible AI use, every audio file generated by NeuTTS Air includes a Perth (Perceptual Threshold) Watermarker.

Technical Specifications:

  • Supported Languages: Currently focused on English.
  • Context Window: A 2048-token context window allows processing approximately 30 seconds of audio, including prompt duration.
  • Inference Speed: Delivers real-time generation on mid-range devices.
  • Power Consumption: Optimized for mobile and embedded devices, ensuring energy efficiency.

Getting Started with NeuTTS Air:

Integrating NeuTTS Air into your projects is straightforward. The project provides a clear guide on cloning the repository, installing necessary dependencies like espeak, and setting up Python environments.

Users can run basic examples to synthesize speech with custom text and reference audio. Furthermore, NeuTTS Air supports streaming mode for generating audio in chunks, offering a dynamic user experience.

Quick Start Guide:

  1. Clone the Repository:
    git clone https://github.com/neuphonic/neutts-air.git
    cd neutts-air
    
  2. Install espeak: Follow platform-specific instructions (e.g., brew install espeak for macOS, sudo apt install espeak for Ubuntu/Debian).
  3. Install Python Dependencies:
    pip install -r requirements.txt
    
  4. (Optional) GGUF Support: Install llama-cpp-python for GGUF models.
  5. (Optional) ONNX Decoder: Install onnxruntime for ONNX decoder usage.

Detailed instructions for running the model, utilizing streaming features, and preparing optimal reference audio for cloning are provided in the project's README.

Responsible AI and Future Development:

Neuphonic emphasizes responsible use of NeuTTS Air and is committed to building faster, smaller, and more ethical on-device voice AI solutions. They encourage developers to contribute and adhere to ethical guidelines when deploying this powerful technology.

NeuTTS Air represents a significant leap forward in making advanced voice AI accessible and deployable on the edge, paving the way for innovative applications across countless industries.

Original Article: View Original

Share this article