A step-by-step guide to installing PocketTTS (pocket_tts, the Python/ONNX runtime), cloning a voice from a reference audio sample, and setting the cloned voice as your default TTS output.
If you want the separate compiled-native runtime, use pocket_tts_cpp and its installer helper instead. That provider has different runtime expectations and is documented in the general TTS guides, not in this ONNX-specific guide.
PocketTTS is a lightweight, ONNX-based local TTS engine with voice cloning. Key characteristics:
- English only - optimized for English speech synthesis
- Voice cloning required - every request needs a reference audio sample (1-60 seconds)
- CPU-friendly - INT8 quantized models run well on CPU; CUDA optional
- Streaming support - progressive audio delivery for low-latency playback
- 24 kHz output - mono, 16-bit PCM internally; converted to your chosen format (MP3, WAV, OPUS, FLAC, AAC, PCM)
Before starting, make sure you have:
- The server repository cloned and your Python virtual environment active
- FFmpeg installed (
brew install ffmpegon macOS,apt-get install -y ffmpegon Linux) - The server starts successfully (you don't need PocketTTS enabled yet):
python -m uvicorn tldw_Server_API.app.main:app --reload
- Note the
X-API-KEYprinted at startup (single-user mode) - you'll use it in all API calls.
Install PocketTTS runtime dependencies and the HuggingFace Hub CLI (for model download):
pip install -e '.[TTS_pocket_tts]'
pip install -U "huggingface_hub"Run from the project root:
python Helper_Scripts/TTS_Installers/install_tts_pocket_tts_onnx.pyThis downloads the ONNX models, tokenizer, and Python module into models/pocket_tts_onnx/ and automatically updates tts_providers_config.yaml to enable PocketTTS with the correct paths.
Useful flags:
| Flag | Effect |
|---|---|
--force |
Re-download even if assets already exist |
--no-config-update |
Download models but don't touch the YAML config |
--output-dir DIR |
Download to a custom directory instead of models/pocket_tts_onnx |
hf download KevinAHM/pocket-tts-onnx \
--local-dir models/pocket_tts_onnx \
--local-dir-use-symlinks False \
--include "onnx/**" "tokenizer.model" "pocket_tts_onnx/**" "pocket_tts_onnx.py"What gets downloaded:
models/pocket_tts_onnx/
onnx/
flow_lm_main_int8.onnx # Main language model (INT8)
flow_lm_flow_int8.onnx # Flow model (INT8)
mimi_decoder_int8.onnx # Audio decoder (INT8)
mimi_encoder.onnx # Audio encoder
text_conditioner.onnx # Text conditioning
flow_lm_main.onnx # FP32 variants (optional)
flow_lm_flow.onnx
mimi_decoder.onnx
tokenizer.model
pocket_tts_onnx.py (or pocket_tts_onnx/)
If you used the installer without --no-config-update, the config is already updated. Otherwise, edit tldw_Server_API/Config_Files/tts_providers_config.yaml:
providers:
pocket_tts:
enabled: true # <-- change to true
model_path: "models/pocket_tts_onnx/onnx"
tokenizer_path: "models/pocket_tts_onnx/tokenizer.model"
module_path: "models/pocket_tts_onnx"
precision: "int8" # "int8" (CPU optimized) or "fp32"
device: "auto" # "auto" | "cpu" | "cuda"
temperature: 0.7
lsd_steps: 10
max_frames: 500
sample_rate: 24000| Setting | Default | Description |
|---|---|---|
precision |
int8 |
Use fp32 for slightly higher quality at the cost of speed and memory |
device |
auto |
Force cpu or cuda; auto picks CUDA if available |
temperature |
0.7 |
Controls randomness in generation (lower = more deterministic) |
max_frames |
500 |
Maximum output length in frames; increase for longer passages |
lsd_steps |
10 |
Denoising steps; more steps = higher quality, slower |
Restart the server, then run these checks:
Check that PocketTTS appears as an available provider:
curl -s http://127.0.0.1:8000/api/v1/audio/providers \
-H "X-API-KEY: $SINGLE_USER_API_KEY" | jqYou should see "pocket_tts" in the list.
Check the voice catalog:
curl -s http://127.0.0.1:8000/api/v1/audio/voices/catalog \
-H "X-API-KEY: $SINGLE_USER_API_KEY" | jq '.pocket_tts'PocketTTS has no built-in voices (voice cloning is required), so this will show an empty or minimal listing. That's expected.
PocketTTS needs a reference audio clip of the voice you want to clone.
| Property | Requirement |
|---|---|
| Duration | 1-60 seconds (5-15 seconds recommended for best quality) |
| Channels | Mono |
| Sample rate | 24 kHz (the adapter auto-converts, but native is best) |
| Format | WAV, MP3, FLAC, OGG, or M4A |
| Content | Single speaker, clear speech, minimal background noise |
If your source audio doesn't meet these specs, convert it:
ffmpeg -i input.mp3 -ar 24000 -ac 1 -c:a pcm_s16le voice_sample.wavTo extract a 10-second clip starting at 5 seconds:
ffmpeg -i input.mp3 -ss 5 -t 10 -ar 24000 -ac 1 -c:a pcm_s16le voice_sample.wav- Record in a quiet room with minimal echo
- Maintain consistent volume and distance from the microphone
- Use natural, conversational speech
- Trim silence from the beginning and end
- Avoid music, other speakers, or background noise
With your voice sample ready, send a TTS request with the audio base64-encoded in the voice_reference field.
# Base64-encode the voice sample
VOICE_B64=$(base64 < voice_sample.wav)
# Generate speech with the cloned voice
curl -sS -X POST http://127.0.0.1:8000/api/v1/audio/speech \
-H "X-API-KEY: $SINGLE_USER_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"pocket_tts\",
\"input\": \"Hello, this is my cloned voice speaking through PocketTTS.\",
\"voice\": \"clone\",
\"voice_reference\": \"$VOICE_B64\",
\"response_format\": \"mp3\"
}" \
--output cloned_output.mp3import base64
import json
from pathlib import Path
from urllib import request, error
# Read and encode the voice sample
voice_b64 = base64.b64encode(
Path("voice_sample.wav").read_bytes()
).decode()
# Build the request
payload = {
"model": "pocket_tts",
"input": "Hello, this is my cloned voice speaking through PocketTTS.",
"voice": "clone",
"voice_reference": voice_b64,
"response_format": "mp3",
}
req = request.Request(
"http://127.0.0.1:8000/api/v1/audio/speech",
data=json.dumps(payload).encode(),
headers={
"Content-Type": "application/json",
"X-API-KEY": "YOUR_API_KEY",
},
method="POST",
)
with request.urlopen(req) as resp:
Path("cloned_output.mp3").write_bytes(resp.read())
print("Saved cloned_output.mp3")Add "stream": true to the request body for progressive audio delivery. The response will be a chunked transfer-encoded stream:
VOICE_B64=$(base64 < voice_sample.wav)
curl -sS -X POST http://127.0.0.1:8000/api/v1/audio/speech \
-H "X-API-KEY: $SINGLE_USER_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"pocket_tts\",
\"input\": \"Streaming cloned voice output from PocketTTS.\",
\"voice\": \"clone\",
\"voice_reference\": \"$VOICE_B64\",
\"response_format\": \"mp3\",
\"stream\": true
}" \
--output cloned_stream.mp3You can override generation settings per-request via extra_params:
{
"model": "pocket_tts",
"input": "Custom generation settings.",
"voice": "clone",
"voice_reference": "<base64>",
"extra_params": {
"max_frames": 800,
"temperature": 0.5,
"stream_first_chunk_frames": 3,
"stream_target_buffer_sec": 0.15,
"stream_max_chunk_frames": 20
}
}Sending base64 audio on every request is inefficient if you reuse the same voice. Upload the sample once and reference it by ID.
curl -sS -X POST http://127.0.0.1:8000/api/v1/audio/voices/upload \
-H "X-API-KEY: $SINGLE_USER_API_KEY" \
-F "file=@voice_sample.wav" \
-F "name=MyVoice" \
-F "provider=pocket_tts"The response includes a voice_id (e.g., "voice_id": "abc123..."). Save this value.
Use custom:<voice_id> as the voice field - no voice_reference needed:
curl -sS -X POST http://127.0.0.1:8000/api/v1/audio/speech \
-H "X-API-KEY: $SINGLE_USER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "pocket_tts",
"input": "Reusing my uploaded voice without sending the audio again.",
"voice": "custom:abc123...",
"response_format": "mp3"
}' \
--output reused_voice.mp3Voice records are persisted in the per-user voice registry database, so they survive server restarts.
When a TTS request omits model and voice, the server uses defaults. Here are three ways to make PocketTTS your default.
Edit tldw_Server_API/Config_Files/tts_providers_config.yaml and move pocket_tts to the top of the priority list:
provider_priority:
- pocket_tts # <-- first in line
- openai
- kokoro
# ...Note: PocketTTS always requires a voice reference. Requests that omit both
voice_referenceand acustom:<voice_id>voice will fail. If you want a zero-configuration fallback, keep a non-cloning provider (likeopenaiorkokoro) in the priority list as well.
export TTS_DEFAULT_PROVIDER=pocket_tts
export TTS_DEFAULT_VOICE=custom:abc123... # your uploaded voice IDEdit tldw_Server_API/Config_Files/config.txt:
[TTS-Settings]
default_provider = pocket_tts
default_voice = custom:abc123...After any of these changes, restart the server for the new defaults to take effect.
| Symptom | Likely Cause | Fix |
|---|---|---|
pocket_tts not in provider list |
Provider not enabled or assets missing | Verify enabled: true in YAML, check model paths exist, restart server |
| "PocketTTS requires voice_reference audio bytes" | No voice sample provided | Include voice_reference (base64) or use custom:<voice_id> |
| "voice reference validation failed" | Audio outside 1-60s, wrong format, or corrupt file | Check duration and format; re-encode with FFmpeg (see Step 5) |
ImportError for pocket_tts_onnx |
Runtime deps not installed | Run pip install -e '.[TTS_pocket_tts]' |
| "PocketTTS models directory not found" | Wrong model_path in config |
Verify the path points to the onnx/ subdirectory containing the .onnx files |
| Slow generation | FP32 precision or large max_frames |
Switch to precision: int8, reduce max_frames, or try device: cuda if GPU available |
| Poor voice quality | Reference audio too short, noisy, or multi-speaker | Use a longer (5-15s), clean, single-speaker sample; adjust temperature (lower = more stable) |
| "PocketTTS ONNX assets missing" | Incomplete download | Re-run installer with --force, or manually verify all 5 ONNX files exist |
Install: pip install -e '.[TTS_pocket_tts]' && pip install -U "huggingface_hub"
Download: python Helper_Scripts/TTS_Installers/install_tts_pocket_tts_onnx.py
Config: tldw_Server_API/Config_Files/tts_providers_config.yaml
Providers: GET /api/v1/audio/providers
Voices: GET /api/v1/audio/voices/catalog?provider=pocket_tts
Synthesize: POST /api/v1/audio/speech
Upload: POST /api/v1/audio/voices/upload