Fish-Speech: Advanced Open-Source TTS System
Fish-Speech Rebranded to OpenAudio: Unleashing the Next Generation of TTS
Fish-Speech, a prominent open-source initiative in the field of Text-to-Speech (TTS), has officially rebranded to OpenAudio. This evolution marks a significant step forward, introducing a new series of advanced TTS models, headlined by OpenAudio S1 and OpenAudio S1-mini. Building on the robust foundation of Fish-Speech, these models promise enhanced quality, performance, and capabilities, solidifying their position as cutting-edge solutions in speech synthesis.
Key Highlights of OpenAudio (Fish-Speech):
- State-of-the-Art Quality: OpenAudio S1 boasts impressive performance, achieving a Word Error Rate (WER) of 0.008 and a Character Error Rate (CER) of 0.004 on English text, as evaluated by Seed TTS Eval Metrics. This makes it a leading model for generating natural-sounding speech.
- Top-Ranked in TTS-Arena2: The OpenAudio S1 model secured the first position on TTS-Arena2, a benchmark for evaluating text-to-speech systems, underscoring its superior quality and performance.
- Advanced Speech Control: Beyond basic text-to-speech, OpenAudio S1 offers granular control over speech output. Users can inject specific emotions (e.g.,
(angry)
,(sad)
,(excited)
), tones (e.g.,(in a hurry tone)
,(whispering)
), and even special audio effects like laughter ((laughing)
,(chuckling)
) and sighs ((sighing)
), allowing for highly expressive and nuanced speech generation. - Zero-shot & Few-shot TTS: The system supports voice cloning with just a 10-30 second vocal sample, enabling high-quality TTS output with a target voice. This feature significantly lowers the barrier to entry for custom voice synthesis.
- Multilingual and Cross-lingual Capabilities: OpenAudio seamlessly handles multilingual text, supporting English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish. The model's strong generalization allows it to process text scripts across various languages without phoneme dependency.
- Efficient and Fast Inference: Optimized with torch compile, the models achieve a real-time factor of approximately 1:7 on an Nvidia RTX 4090 GPU, ensuring fast and responsive speech generation.
- User-Friendly Interfaces: OpenAudio provides both a Gradio-based WebUI for easy in-browser inference and a PyQt6-based GUI for desktop applications, supporting Windows, Linux, and macOS. Deployment is also streamlined with native inference servers.
Model Availability:
- OpenAudio S1: The flagship model with 4 billion parameters, available on fish.audio.
- OpenAudio S1-mini: A distilled version with 0.5 billion parameters, optimized for core capabilities and available on Hugging Face Spaces.
Both models incorporate online Reinforcement Learning from Human Feedback (RLHF), further refining their output quality. With a strong community backing, extensive documentation, and continuous development evidenced by numerous commits and releases, OpenAudio (formerly Fish-Speech) is a highly recommended project for anyone interested in the cutting edge of Text-to-Speech technology. Explore the project on GitHub to contribute or integrate its powerful features into your own applications.