Voice‑Pro: The All‑In‑One Open‑Source AI Dubbing Studio

The world of AI‑powered media creation is expanding rapidly. If you’ve been hunting for a free, open‑source solution that unifies text‑to‑speech (TTS), voice cloning, real‑time translation, and multimedia processing—look no further than Voice‑Pro.

What is Voice‑Pro?

Open‑source Web UI built on Gradio 5.14.0, released under the GPL‑3.0 license.
Speech recognition powered by Whisper, Faster‑Whisper, Whisper‑Timestamped, and WhisperX.
Zero‑shot voice cloning: E2‑TTS, F5‑TTS, CosyVoice, and Kokoro.
Text‑to‑speech: Edge‑TTS (100+ languages, 400+ voices), Kokoro (ranked #2 on HF TTS Arena), and optional paid Azure TTS.
Multilingual translation with Deep‑Translator (100+ languages, optional Azure Translator).
YouTube downloader (yt‑dlp) + audio isolation (Demucs) + subtitle generation.
Supports Windows (NVIDIA GPU), macOS, and Linux.

Who Can Benefit?

Podcasters & YouTubers: Create dubbed episodes with AI voices without paying for subscription plans.
Educators & e‑learning creators: Generate multilingual subtitles and translations for videos.
Developers & researchers: Experiment with cutting‑edge TTS models in a sandbox.
Content creators: Produce karaoke tracks or AI‑generated audiobooks.

Getting Started – Installation

Prerequisites

Component	Minimum	Recommended
OS	Windows 10/11, macOS 10.15+, Ubuntu 20.04+	All
GPU	None for CPU, otherwise NVIDIA CUDA 12.4	NVIDIA 8 GB+ VRAM
RAM	4 GB	8 GB+
Disk	20 GB free	30 GB+

Clone the Repo

git clone https://github.com/abus-aikorea/voice-pro.git
cd voice-pro

Configure (Windows)

configure.bat   # installs ffmpeg, checks CUDA, downloads models

Configure (macOS/Linux)

chmod +x configure.sh
./configure.sh

Tip: The first run will download large model checkpoints (~10 GB). Ensure a fast Internet connection.

Run the WebUI

start.bat   # Windows

./start.sh  # macOS/Linux

The Gradio interface will start at http://127.0.0.1:7870/. Open it in your browser.

Using Voice‑Pro – Step by Step

Upload Video or Audio – In the Dubbing Studio tab, paste a YouTube URL or upload an MP4/WAV file.
Extract Audio – The tool automatically calls yt‑dlp to pull video audio and Demucs to separate vocals.
Transcribe – WhisperX generates a high‑quality transcript in your target language (choose from >100 options).
Translate – Instant translation to any language using Deep‑Translator.
Choose a Voice – Pick an existing voice via Edge‑TTS or clone a reference sample with F5‑TTS/CosyVoice – no fine‑tuning required.
Synthesize – TTS with adjustable speed, volume, pitch. Export as WAV/FLAC/MP3.
Sync & Export – Automatically creates SRT subtitles, uploads them back to YouTube, or saves locally.

Advanced Features

Zero‑shot cloning: No model training, just supply a short audio clip.
Custom compute type: Switch between float32, float16, or int8 (quantized) to balance quality vs. GPU usage.
Real‑time demos: On the Live Translation tab, speak into the mic and watch subtitles appear in real time.
API‑like interface: The Gradio server can be wrapped by other Python scripts; see app/voice_pro.py for examples.
Community voice library: Contributors can add new celebrity voices via GitHub Issues; a curated list is hosted in celebrities30sREADME.

Why Voice‑Pro Outperforms SaaS

Voice‑Pro removes subscription fatigue:

Free for all core features – no per‑minute costs.
Open‑source – you can modify the TTS pipeline or integrate your own models.
GPU flexibility – run on a laptop or deploy to a cloud GPU instance.
Feature parity – Supports the same TTS engines that commercial services like ElevenLabs use, plus deeper controls.

Troubleshooting Quick‑Fixes

Issue	Fix
CUDA OOM	Reduce denoise level or switch to int8 compute
Whisper errors	Ensure `requirements-voice-gpu.txt` or `-cpu.txt` is installed; delete `installer_files` then rerun `configure`
Subtitles off‑sync	Use the WhisperX tab to re‑align timestamps

Community & Next Steps

Check out the GitHub Discussions for feature requests and support.
Contribute by adding new voice samples or optimizing existing models.
Experiment with adding your own Hugging Face pipelines – the modular design makes it straightforward.
Consider sponsoring the repo or buying a “premium” upgrade (Azure TTS/Translator) if you need enterprise‑grade quality.

Final Word

Voice‑Pro is a powerful, zero‑cost alternative to pricey AI dubbing services. Its modular open‑source nature means you’re not locked into a vendor; you own the code, the models, and the output. Whether you’re a YouTuber looking to dub a video in 12 languages, a researcher's lab needing fast prototyping of voice clones, or a student in a language class, Voice‑Pro gives you the tools to turn speech and text into high‑fidelity audio in minutes.

Get started today, and bring the future of AI audio to your projects—without paying a dime.