Skip to content

Echoshard/Piper_TTS_Training_Suite

Repository files navigation

Piper TTS Training Suite 🪈

A comprehensive tool for creating custom datasets and training Piper TTS voices on Windows (using Docker/WSL). Now features a modern web interface (Gradio) in addition to the classic desktop GUI.

How it works: It uses Chatterbox voice cloning to generate a custom synthetic dataset based on your reference audio (uploaded or recorded), which is then used to fine-tune a Piper model.

Features ✨

  • Dual Interfaces:
    • Web UI (Gradio): Modern, browser-based interface with dark mode, audio recording, and real-time logs.
    • Desktop App (Tkinter): Classic Windows-native application.
  • Express Clone:
    • One-click pipeline: Reference Audio (Upload/Record) -> Dataset -> Training -> ONNX Export -> Sample Audio.
    • Real-time status updates and estimated completion time.
  • Dataset Generator: Automatically generate paired audio/text datasets using a single reference voice file. Supports 1,500+ phrases.
  • Startup Checks: Automatically verifies Docker, Piper, Chatterbox, and Hugging Face status on launch to ensure a smooth experience.
  • Docker Integration: Trains securely in an isolated container with full GPU support (NVIDIA).
  • Easy Training:
    • Resume from checkpoints ("Fine-Tuning").
    • Start from scratch.
    • Additive epoch logic (Train 50 more epochs easily).
  • Testing & Export:
    • Export checkpoints to ONNX.
    • Test voices instantly with local synthesis playback (Zero-Shot & ONNX).

Prerequisites 🛠️

  1. Windows 10/11 with WSL2 enabled.
  2. Docker Desktop (configured to use WSL2 backend).
  3. Python 3.10+ (Added to PATH).
  4. NVIDIA GPU with updated drivers (for CUDA support).

Hugging Face Authentication 🔑 (Required)

A Hugging Face token is required to download the Chatterbox text-to-speech model used for voice cloning.

  1. Get Token: Go to huggingface.co/settings/tokens.
  2. Create: Create a new token with Read permissions.
  3. Login: When you launch the app, it will check for your login. If missing, paste your token into the console prompt.

Installation 💾

  1. Clone this repository.
  2. Run setup.bat.
    • Creates a virtual environment (venv).
    • Installs Python dependencies.
    • Builds the custom Docker image.
    • Automatically downloads recommended base checkpoints into the pretrained folder.
  3. Launch the App:
    • Web UI (Recommended): Run run_gradio.bat.
    • Desktop UI: Run run_gui.bat.

Usage Guide 📖

1. Express Clone (Web UI) 🚀

The fastest way to get a cloned voice.

  1. Launch run_gradio.bat and open the link in your browser.
  2. Go to the Express Clone tab.
  3. Settings (Left Column):
    • Voice Name: e.g., MyNewVoice.
    • Language: Select target language (e.g., en-us).
    • Quality/Epochs: defaults (Medium/300) are usually good.
    • Checkpoint: Select a base model (e.g., lessac for female, ryan for male) for best results.
  4. Reference Audio (Right Column):
    • Upload an audio file OR Record directly from your microphone.
  5. Click START EXPRESS CLONE.
    • Watch the Logs and Status indicators updates in real-time.
    • When finished, your new voice model (.onnx) is saved in the exports/ folder.

2. Command Line Interface (CLI) 💻

Run the pipeline "headless" via valid command line commands.

Basic Usage:

# Activate venv first!
.\venv\Scripts\activate

# Run the script
python cloneToPiper.py MyVoiceName path/to/reference.wav

Advanced Usage:

python cloneToPiper.py MyVoiceName path/to/reference.wav --samples 200 --epochs 500 --quality high --language en-us --checkpoint path/to/base.ckpt

Arguments:

  • voicename: Name of the voice/dataset.
  • inference: Path to reference audio file.
  • --samples: Number of samples (default: 150).
  • --epochs: Number of epochs (default: 300).
  • --quality: low, medium, high.
  • --language: Language code (e.g. en-us).
  • --checkpoint: Path to base .ckpt file.

3. Manual Dataset Generation 🎤

Create a high-quality dataset from a single reference audio file.

  1. Go to the Dataset Generator tab.
  2. Reference Audio: Select a clear .wav or .mp3 file of the voice you want to clone.
  3. Dataset Name: Give your voice a name (e.g., my_custom_voice).
  4. Generator Settings:
    • Use the Slider to choose how many samples to generate (5 to 1,500).
    • Tip: Start with ~50 for a quick test, or 500+ for better quality.
  5. Click GENERATE DATASET.
    • The app will calculate an estimated runtime.
    • Progress is shown in the log window.
  6. Once done, stick with the defaults or manually inspect the datasets/ folder.

4. Training (Docker) 🚂

Train the model using your generated dataset.

  1. Go to the Training (Docker) tab.
  2. Dataset Folder: Select your dataset from the dropdown (e.g., my_custom_voice).
  3. Base Checkpoint (Fine-Tuning):
    • CRITICAL STEP: It is highly recommended to start with a pre-trained model.
    • The app will automatically select a pre-downloaded checkpoint (from setup.bat). Verify it matches your target voice's tone/gender.
  4. Epochs to Run:
    • Enter the total target epochs or the additional epochs you want to run.
    • The app automatically calculates the final target (e.g., Checkpoint is 3000 + Input 500 = Stop at 3500).
  5. Quality: Select High, Medium, or Low.
    • IMPORTANT: You MUST match the quality of your base checkpoint.
    • If you downloaded a medium checkpoint, set this to Medium.
    • If you downloaded a high checkpoint, set this to High.
    • Default is Medium.
  6. Click START TRAINING.
    • A Docker container will spin up.
    • Watch the Status Label for "Epoch X/Y" and time estimates.
    • Use STOP TRAINING to safely halt the process (checkpoints are saved automatically).

5. Testing & Export 🧪

Test your voice and export it for use in other apps.

  1. Go to the Testing / Export tab.
  2. Step 1: Export:
    • Select a .ckpt file from your training logs.
    • Click Export to ONNX. This creates a .onnx and .json file in the exports folder.
  3. Step 2: Test (Local):
    • Select your exported .onnx model.
    • Config (.json): The app should find the matching JSON automatically. If not, browse for it.
    • Type some text and click Synthesize & Play.

6. Automated Pipeline (cloneToPiper.py) ⚡

For users who prefer the command line or want to automate multiple trainings, cloneToPiper.py provides a full "headless" pipeline.

What it does:

  1. Dataset Generation: Uses Chatterbox to create WAVs and metadata.
  2. Training: Launches the Docker container and monitors epochs.
  3. Export: Automatically finds the best checkpoint and exports to ONNX.
  4. Verification: Generates a test audio sample automatically.

How to use:

  1. Open a terminal in the project root.
  2. Activate the environment: .\venv\Scripts\activate
  3. Run the command:
    python cloneToPiper.py <VoiceName> <PathToAudio> [Options]

Example:

python cloneToPiper.py MyClonedVoice C:\audio\samples\voice.wav --samples 150 --epochs 300 --quality medium --language en-us --checkpoint C:\pretrained\base.ckpt

Notes ⚙️

Docker image override

By default, the GUI/CLI will use the RTX 50xx-compatible image tag:

  • chatterbox-piper:nightly-cu128-sm120

If you want to use a different Docker image tag (for testing or custom builds), set the PIPER_TRAIN_IMAGE environment variable.

PowerShell:

$env:PIPER_TRAIN_IMAGE = 'piper-custom-train'

cmd.exe:

set PIPER_TRAIN_IMAGE=piper-custom-train

For best results with a pre-trained voice:

  • Target Samples: ~150 phrases.
  • Epochs: 300 to 500.
    • Note: 300 is often sufficient, but 500 may yield better stability and quality.

Strategy

  1. Download a recommended base checkpoint (e.g., lessac or ryan from HuggingFace).
  2. Place it in the pretrained/ folder (or select it via the "Get Checkpoints" links in the app).
  3. Fine-tune for 300-500 epochs.

Timing Tested on RTX 4070:

  • Chatter box Data Set Creation: (150 phrases) ~ 1 Hour
  • Epochs Run on Pretrained: 500 Epochs ~ 2.5 Hours

Language

This application defaults to english and defaults to en-us, it can train additional languages, to do this you will need to edit phrases.py and add your new language. in addition change the check point to the selected language for optimal results.

Linux:

This application is currently only supported on windows, setup.bat and run_gui.bat are not compatible with linux and need to be converted to .sh files.


Credits 🏆

Big thanks to the open-source community for making this possible!


⚠️ Ethics & Responsibility

Please use this tool responsibly.

Voice cloning technology is powerful but carries ethical risks.

  • Do not clone voices without consent.
  • Do not generate content intended to deceive, defraud, or harass.
  • Always label AI-generated content appropriately.

By using this software, you agree to assume full responsibility for how you use the voices you train.

About

This is a work flow for making a wave inference and converting training a small piper voice for use in applications.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors