SillyTavern — WebSocket TTS Real-Time Streaming

A SillyTavern extension that registers as a proper TTS provider — giving you the Narrate button, per-character voice map, and voice preview — with a bonus real-time streaming mode that speaks each sentence as the LLM generates it, instead of waiting for the full reply.

Works with any TTS server that implements the simple WebSocket interface described in SERVER.md. The reference server implementation is faster-qwen3-tts, but you can use any model (XTTS, Piper, Kokoro, Orpheus, custom, …).

How it works

LLM token stream → WebSocket /ws/tts → TTS server generates per sentence → WAV audio in browser

Real-time streaming mode:

When ST starts an LLM reply, the extension opens a WebSocket to your TTS server.
Each token is forwarded incrementally.
The server buffers tokens, detects sentence boundaries, and generates audio per sentence.
Each WAV is scheduled and played immediately via the Web Audio API.
[END] is sent when the LLM finishes to flush any remaining text.

Result: The first sentence starts playing ~1–3 s after the LLM begins typing.

Narrate / message replay:
The extension also handles the Narrate button and per-message audio replay by collecting the full audio over the same WebSocket and returning it to ST's audio system — same endpoint, no extra server work needed.

Prerequisites

Requirement	Notes
A TTS server implementing `GET /speakers` + `WS /ws/tts`	See SERVER.md for the interface spec. Reference: faster-qwen3-tts
SillyTavern (recent release)	Tested on release branch
GPU recommended	For real-time generation speed; CPU works but adds latency

Installation

Copy the extension folder into ST's third-party extensions directory:

<SillyTavern>/public/scripts/extensions/third-party/SillyTavern-TTS-Real-Time-Streaming/

Docker (volume-mounted): if ST uses a volume mount for third-party, drop the folder there and hard-refresh.

Hard-refresh the browser (Ctrl+Shift+R). The extension loads automatically.

Configuration

Open Extensions panel → TTS (the headphone icon).
In the Provider dropdown, select "WS TTS (Streaming)".
The provider settings panel appears:

Setting	Description	Example
Provider Endpoint	WebSocket URL of your server's `/ws/tts` route	`ws://192.168.1.100:7860/ws/tts`
Language	Language for synthesis	`English`, `Japanese`, …
Volume	Playback volume (0.0 – 2.0)	`1.0`
Real-time Streaming	Speak while typing (vs. wait for Narrate)	✅

Click ↻ Refresh in the TTS panel to load available voices from the server.
Assign voices to characters in the Voice Map section (ST-managed, per character).
Click the 🔊 preview icon next to any voice to test it.

Real-time Streaming + Auto-read aloud

If Real-time Streaming is enabled:

The extension plays audio during generation.
When generation ends, ST's Auto-read aloud would play it a second time.
To avoid double-playback: go to TTS → Settings → uncheck "Auto-read aloud", OR rely only on the Narrate button for past messages (the extension auto-suppresses double-play for the most recent message).

Comparison with other TTS extensions

	XTTSv2 (built-in)	tts-webui	This extension
When TTS starts	After LLM finishes	After LLM finishes	As LLM types (per sentence)
Latency to first word	Full generation time	Full generation time	~1–3 s after LLM starts
Protocol	HTTP GET streaming	HTTP POST	WebSocket
Narrate button	✅	✅	✅
Per-character voice map	✅	✅	✅
Voice preview	✅	✅	✅
HTTPS required	No	Yes (AudioWorklet)	No

Troubleshooting

Provider doesn't appear in the TTS dropdown

Hard-refresh (Ctrl+Shift+R) — the JS module must fully load before the provider registers.
Check browser console for errors (F12).

"Refresh" shows no voices

Confirm the server is reachable: curl http://192.168.1.100:7860/speakers
Ensure the URL in Provider Endpoint is correct (the /ws/tts path; the extension strips it to reach /speakers).

Double audio

Disable Auto-read aloud in TTS settings, or rely on the 8-second dedup window.

First sentence is slow / cut off

The server needs a sentence-ending character before it flushes. Very short openers are normal.
If using Qwen3-TTS, try the lighter Qwen3-TTS-12Hz-0.6B-Base model for lower latency.

Narrate button plays silence

If you clicked Narrate within 8 s of a streamed reply, the dedup window suppressed it. Wait 8 s and try again.

Server API

See SERVER.md for full endpoint documentation, request/response formats, and timing requirements.

Files

SillyTavern-TTS-Real-Time-Streaming/
├── manifest.json   Extension metadata (v1.4.0)
├── index.js        Provider class + real-time streaming logic
├── README.md       This file
└── SERVER.md       Server interface specification

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SillyTavern — WebSocket TTS Real-Time Streaming

How it works

Prerequisites

Installation

Configuration

Real-time Streaming + Auto-read aloud

Comparison with other TTS extensions

Troubleshooting

Server API

Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
README.md		README.md
SERVER.md		SERVER.md
index.js		index.js
manifest.json		manifest.json

Folders and files

Latest commit

History

Repository files navigation

SillyTavern — WebSocket TTS Real-Time Streaming

How it works

Prerequisites

Installation

Configuration

Real-time Streaming + Auto-read aloud

Comparison with other TTS extensions

Troubleshooting

Server API

Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages