Voice-Pro: An Open-Source All-in-One AI Audio & Dubbing Suite

For creators and developers, the current landscape of AI audio tools is fragmented. You often find yourself jumping between a YouTube downloader, a separate vocal isolation tool, a transcription service, and a voice cloning platform. Voice-Pro changes that by consolidating these essential tasks into a single, cohesive Gradio-based WebUI.

Originally a commercial project, the developers have recently open-sourced the entire codebase, making it a powerful, free alternative to subscription-heavy platforms like ElevenLabs or Descript.

What is Voice-Pro?

Voice-Pro is designed as a "Dubbing Studio" that handles the entire pipeline of multimedia content creation. Whether you are a podcaster looking to translate your content into multiple languages or a developer building an automated video processing pipeline, this tool provides a unified interface for the best open-source models available today.

Core Capabilities:

Audio Extraction: Built-in yt-dlp support for downloading and processing YouTube content directly.
Vocal Isolation: Uses Demucs to cleanly separate vocals from background music, essential for high-quality voice cloning.
Speech-to-Text (STT): Supports a variety of Whisper implementations, including Faster-Whisper, Whisper-Timestamped, and WhisperX for high-accuracy, word-level transcription.
Zero-Shot Voice Cloning: Features cutting-edge models like F5-TTS, E2-TTS, and CosyVoice, allowing you to clone voices with minimal reference audio.
Text-to-Speech (TTS): Includes Edge-TTS for high-quality, natural-sounding speech and kokoro, a high-performance TTS model currently trending in the HuggingFace arena.
Translation: Integrated Deep-Translator for instant, multilingual support across 100+ languages.

Why Developers Should Care

Unlike SaaS platforms that charge per-minute fees, Voice-Pro is a self-hosted solution. If you have an NVIDIA GPU (with at least 4GB-8GB VRAM), you can run these models locally without worrying about API costs or data privacy concerns.

Technical Stack Highlights:

Framework: Built on Python 3.10.15 with Gradio 5.14.0.
Compute: Optimized for CUDA 12.4, ensuring fast inference for heavy tasks like voice cloning and transcription.
Extensibility: Because it is open-source, you can modify the start-voice.py or one_click.py scripts to integrate your own custom models or fine-tuned weights.

Getting Started

Installation is designed to be "one-click" for Windows users, though it is also compatible with Linux and Mac environments.

Clone the repository:

git clone https://github.com/abus-aikorea/voice-pro.git

Configure the environment: Run configure.bat (or configure.sh on Linux/Mac). This script handles the heavy lifting of setting up Git, FFmpeg, and the necessary CUDA dependencies.
Launch the UI: Run start.bat. On the first run, the application will download the necessary model weights (such as the 9GB CosyVoice model), so ensure you have a stable internet connection.

Troubleshooting & Optimization

CUDA Out-Of-Memory (OOM): If you hit memory limits, try setting the Denoise level to 0 or 1. Additionally, using int compute types instead of float can significantly reduce VRAM usage at the cost of slight quality degradation.
Subtitle Quality: If your transcriptions aren't meeting your standards, remember that the model size matters. While large models provide the best accuracy, they require more compute. Experiment with medium or small models if you are processing long-form content on consumer hardware.

Final Thoughts

Voice-Pro represents the best of the open-source AI community. By wrapping complex models like F5-TTS and WhisperX into a user-friendly WebUI, it lowers the barrier to entry for high-quality content production. Whether you are using it for personal projects or as a base for your own AI-powered application, it is a repository worth exploring.

Check out the project on GitHub to contribute or view the latest updates.

Source

abus-aikorea/voice-pro: Gradio WebUI for creators and developers, featuring key TTS (Edge-TTS, kokoro) and zero-shot Voice Cloning (E2 & F5-TTS, CosyVoice), with Whisper audio processing, YouTube download, Demucs vocal isolation, and multilingual translation.