A comprehensive tool for creating custom datasets and training Piper TTS voices on Windows (using Docker/WSL). Now features a modern web interface (Gradio) in addition to the classic desktop GUI.
How it works: It uses Chatterbox voice cloning to generate a custom synthetic dataset based on your reference audio (uploaded or recorded), which is then used to fine-tune a Piper model.
- Dual Interfaces:
- Web UI (Gradio): Modern, browser-based interface with dark mode, audio recording, and real-time logs.
- Desktop App (Tkinter): Classic Windows-native application.
- Express Clone:
- One-click pipeline: Reference Audio (Upload/Record) -> Dataset -> Training -> ONNX Export -> Sample Audio.
- Real-time status updates and estimated completion time.
- Dataset Generator: Automatically generate paired audio/text datasets using a single reference voice file. Supports 1,500+ phrases.
- Startup Checks: Automatically verifies Docker, Piper, Chatterbox, and Hugging Face status on launch to ensure a smooth experience.
- Docker Integration: Trains securely in an isolated container with full GPU support (NVIDIA).
- Easy Training:
- Resume from checkpoints ("Fine-Tuning").
- Start from scratch.
- Additive epoch logic (Train 50 more epochs easily).
- Testing & Export:
- Export checkpoints to ONNX.
- Test voices instantly with local synthesis playback (Zero-Shot & ONNX).
- Windows 10/11 with WSL2 enabled.
- Docker Desktop (configured to use WSL2 backend).
- Python 3.10+ (Added to PATH).
- NVIDIA GPU with updated drivers (for CUDA support).
A Hugging Face token is required to download the Chatterbox text-to-speech model used for voice cloning.
- Get Token: Go to huggingface.co/settings/tokens.
- Create: Create a new token with Read permissions.
- Login: When you launch the app, it will check for your login. If missing, paste your token into the console prompt.
- Clone this repository.
- Run
setup.bat.- Creates a virtual environment (
venv). - Installs Python dependencies.
- Builds the custom Docker image.
- Automatically downloads recommended base checkpoints into the
pretrainedfolder.
- Creates a virtual environment (
- Launch the App:
- Web UI (Recommended): Run
run_gradio.bat. - Desktop UI: Run
run_gui.bat.
- Web UI (Recommended): Run
The fastest way to get a cloned voice.
- Launch
run_gradio.batand open the link in your browser. - Go to the Express Clone tab.
- Settings (Left Column):
- Voice Name: e.g.,
MyNewVoice. - Language: Select target language (e.g.,
en-us). - Quality/Epochs: defaults (Medium/300) are usually good.
- Checkpoint: Select a base model (e.g.,
lessacfor female,ryanfor male) for best results.
- Voice Name: e.g.,
- Reference Audio (Right Column):
- Upload an audio file OR Record directly from your microphone.
- Click START EXPRESS CLONE.
- Watch the Logs and Status indicators updates in real-time.
- When finished, your new voice model (
.onnx) is saved in theexports/folder.
Run the pipeline "headless" via valid command line commands.
Basic Usage:
# Activate venv first!
.\venv\Scripts\activate
# Run the script
python cloneToPiper.py MyVoiceName path/to/reference.wavAdvanced Usage:
python cloneToPiper.py MyVoiceName path/to/reference.wav --samples 200 --epochs 500 --quality high --language en-us --checkpoint path/to/base.ckptArguments:
voicename: Name of the voice/dataset.inference: Path to reference audio file.--samples: Number of samples (default: 150).--epochs: Number of epochs (default: 300).--quality:low,medium,high.--language: Language code (e.g.en-us).--checkpoint: Path to base.ckptfile.
Create a high-quality dataset from a single reference audio file.
- Go to the Dataset Generator tab.
- Reference Audio: Select a clear
.wavor.mp3file of the voice you want to clone. - Dataset Name: Give your voice a name (e.g.,
my_custom_voice). - Generator Settings:
- Use the Slider to choose how many samples to generate (5 to 1,500).
- Tip: Start with ~50 for a quick test, or 500+ for better quality.
- Click GENERATE DATASET.
- The app will calculate an estimated runtime.
- Progress is shown in the log window.
- Once done, stick with the defaults or manually inspect the
datasets/folder.
Train the model using your generated dataset.
- Go to the Training (Docker) tab.
- Dataset Folder: Select your dataset from the dropdown (e.g.,
my_custom_voice). - Base Checkpoint (Fine-Tuning):
- CRITICAL STEP: It is highly recommended to start with a pre-trained model.
- The app will automatically select a pre-downloaded checkpoint (from
setup.bat). Verify it matches your target voice's tone/gender.
- Epochs to Run:
- Enter the total target epochs or the additional epochs you want to run.
- The app automatically calculates the final target (e.g., Checkpoint is 3000 + Input 500 = Stop at 3500).
- Quality: Select High, Medium, or Low.
- IMPORTANT: You MUST match the quality of your base checkpoint.
- If you downloaded a
mediumcheckpoint, set this to Medium. - If you downloaded a
highcheckpoint, set this to High. - Default is Medium.
- Click START TRAINING.
- A Docker container will spin up.
- Watch the Status Label for "Epoch X/Y" and time estimates.
- Use STOP TRAINING to safely halt the process (checkpoints are saved automatically).
Test your voice and export it for use in other apps.
- Go to the Testing / Export tab.
- Step 1: Export:
- Select a
.ckptfile from your training logs. - Click Export to ONNX. This creates a
.onnxand.jsonfile in theexportsfolder.
- Select a
- Step 2: Test (Local):
- Select your exported
.onnxmodel. - Config (.json): The app should find the matching JSON automatically. If not, browse for it.
- Type some text and click Synthesize & Play.
- Select your exported
For users who prefer the command line or want to automate multiple trainings, cloneToPiper.py provides a full "headless" pipeline.
What it does:
- Dataset Generation: Uses Chatterbox to create WAVs and metadata.
- Training: Launches the Docker container and monitors epochs.
- Export: Automatically finds the best checkpoint and exports to ONNX.
- Verification: Generates a test audio sample automatically.
How to use:
- Open a terminal in the project root.
- Activate the environment:
.\venv\Scripts\activate - Run the command:
python cloneToPiper.py <VoiceName> <PathToAudio> [Options]
Example:
python cloneToPiper.py MyClonedVoice C:\audio\samples\voice.wav --samples 150 --epochs 300 --quality medium --language en-us --checkpoint C:\pretrained\base.ckptBy default, the GUI/CLI will use the RTX 50xx-compatible image tag:
chatterbox-piper:nightly-cu128-sm120
If you want to use a different Docker image tag (for testing or custom builds), set the PIPER_TRAIN_IMAGE environment variable.
PowerShell:
$env:PIPER_TRAIN_IMAGE = 'piper-custom-train'cmd.exe:
set PIPER_TRAIN_IMAGE=piper-custom-trainFor best results with a pre-trained voice:
- Target Samples: ~150 phrases.
- Epochs: 300 to 500.
- Note: 300 is often sufficient, but 500 may yield better stability and quality.
Strategy
- Download a recommended base checkpoint (e.g.,
lessacorryanfrom HuggingFace). - Place it in the
pretrained/folder (or select it via the "Get Checkpoints" links in the app). - Fine-tune for 300-500 epochs.
Timing Tested on RTX 4070:
- Chatter box Data Set Creation: (150 phrases) ~ 1 Hour
- Epochs Run on Pretrained: 500 Epochs ~ 2.5 Hours
Language
This application defaults to english and defaults to en-us, it can train additional languages, to do this you will need to edit phrases.py and add your new language. in addition change the check point to the selected language for optimal results.
Linux:
This application is currently only supported on windows, setup.bat and run_gui.bat are not compatible with linux and need to be converted to .sh files.
Big thanks to the open-source community for making this possible!
- Fixing Dependency Hell: Cal Bryant's Blog
- Docker Settings & Training Bugs:
- Helpful Guide: Create Custom Piper TTS Voice
- Voice Cloning Backend: resemble-ai/chatterbox
- Original Repositories:
- Dataset Phrases
- The LJ Speech Dataset (The History of Printing)
Please use this tool responsibly.
Voice cloning technology is powerful but carries ethical risks.
- Do not clone voices without consent.
- Do not generate content intended to deceive, defraud, or harass.
- Always label AI-generated content appropriately.
By using this software, you agree to assume full responsibility for how you use the voices you train.