A local, modular harness for experimenting with realtime conversational AI stacks.
The implementation provides a runnable scaffold with clear provider boundaries:
- VAD: Silero VAD (ONNX on CPU) with configurable speech threshold, hysteresis, hangover, and minimum speech duration gating.
- ASR: faster-whisper with CUDA or CPU support, mock fallback included.
- LLM: llama.cpp OpenAI-compatible adapter with runtime-adjustable sampler settings.
- TTS: Pocket TTS and Kokoro ONNX voice output, mock tone fallback included.
- UI: local browser app with Chat and Settings tabs, text/voice turns, event stream, and latency metrics.
The default start profile is profiles/llamacpp-cuda-asr.json, which uses Silero VAD, CUDA faster-whisper ASR, llama.cpp, and Pocket TTS. A CPU ASR fallback is available at profiles/llamacpp-local.json. A no-model smoke-test profile is at profiles/mock-local.json.
Windows:
.\install.ps1
.\start.ps1Linux/macOS:
./install.sh
./start.shThen open http://127.0.0.1:7860.
- Python 3.11+ (the project venv uses whatever Python created it)
- llama.cpp server running separately for LLM inference (see llama.cpp section)
- CUDA 12+ (optional, for GPU ASR — the harness falls back to CPU if unavailable)
The CUDA ASR profile requires CTranslate2 to find CUDA 12 runtime DLLs. If your system CUDA toolkit is v13+, cublas64_12.dll will be missing. The fix:
.\.venv\Scripts\pip install nvidia-cublas-cu12The start.ps1 script automatically adds the NVIDIA bin directories from .venv to PATH before launching. If you see Library cublas64_12.dll is not found or cannot be loaded, verify this package is installed.
The web UI captures microphone input and streams it to the backend as framed PCM:
- encoding:
pcm_s16le - sample rate: profile
audio.sample_rate, default16000 - channels: profile
audio.channels, default1 - frame size: profile
audio.frame_ms, default30
Silero VAD runs on this stream and emits vad.probability, vad.speech_start, and vad.speech_end. Speech start cancels active TTS playback for barge-in support. A minimum speech duration gate (min_speech_duration_ms) filters out short noise bursts like keystrokes before they reach ASR.
The default CUDA profile uses faster-whisper with large-v3-turbo on GPU:
"asr": {
"provider": "faster-whisper",
"model": "large-v3-turbo",
"device": "cuda",
"compute_type": "float16"
}A CPU fallback profile is available:
$env:HARNESS_PROFILE="profiles/llamacpp-local.json"
.\start.ps1The model is loaded on first spoken utterance. If it is not already cached, faster-whisper/Hugging Face may download it at that point. After Silero emits vad.speech_end, the harness transcribes the buffered utterance and feeds the transcript into the same llama.cpp turn pipeline used by typed messages.
During ASR the UI reports progress stages such as queued, loading, loaded, segment, and complete. If first use appears slow, check server.err.log for Hugging Face download/cache warnings.
The default profile uses Pocket TTS with the voice azelma:
"tts": {
"provider": "pocket-tts",
"voice": "azelma"
}Pocket TTS loads the model and voice state on first use, then reuses them for later turns. The model is gated on Hugging Face, so accept the model terms for kyutai/pocket-tts and make sure your Hugging Face token is available if first use reports an access error.
For smoother playback, the local profiles feed Pocket TTS larger phrase-sized chunks and the browser schedules decoded PCM chunks on a shared Web Audio timeline instead of playing each chunk as a separate audio element.
An additional profile is included for Kokoro v1.0 ONNX on CPU:
$env:HARNESS_PROFILE="profiles/llamacpp-kokoro-onnx.json"
.\doctor.ps1
.\start.ps1Kokoro uses local ONNX files and downloads the current v1.0 assets on first use into model-cache/kokoro. The runtime TTS selector in the web UI can switch between Pocket TTS and Kokoro without restarting the harness.
Kokoro is tuned separately from Pocket TTS because its CPU ONNX path benefits from shorter phrase chunks. The Kokoro profile and preset use lower tts_chunk_chars / min_tts_chars values, prewarm English G2P on model load, and expose optional CPU threading controls:
"onnx_intra_op_num_threads": 4,
"onnx_inter_op_num_threads": 1,
"preload_g2p": trueThe UI has a Settings tab (next to Chat in the topbar) with three sections:
Set a display name for yourself and the AI. Names are used:
- In chat bubble labels (replacing "You" and "Assistant")
- In the system prompt sent to the LLM
Adjust llama.cpp sampler parameters at runtime without restarting:
| Parameter | Description |
|---|---|
| Temperature | Sampling randomness (0.0 - 2.0) |
| Top K | Limit to top K tokens |
| Top P | Nucleus sampling threshold |
| Min P | Minimum probability threshold |
| Repeat Penalty | Discourages token repetition |
| Frequency Penalty | Penalizes frequent tokens |
| Presence Penalty | Penalizes present tokens |
| Max Tokens | Max tokens per turn (-1 = unlimited) |
Settings are pre-filled from the llama.cpp server's current defaults on startup. Edits are saved to user_settings.json and persist across restarts. The "Reset to Defaults" button clears all overrides.
Import TavernAI V2 character cards (.png with embedded data or standalone .json). The card's name, description, personality, scenario, and example dialogue are assembled into the system prompt, replacing any automatic name-based prompt. Template variables {{user}} and {{char}} are resolved to the configured names.
You can add extra instructions in the "Additional instructions" textarea below the card, which are appended after the character data. The manual system prompt textarea on the Chat tab can also supplement the card.
The Chat tab has a "Continue" button next to Send. When the LLM response is cut off (by hitting the context limit or max tokens), clicking Continue asks the model to pick up where it left off. The last assistant message stays in the conversation history as a prefix, and new tokens are appended to the existing bubble.
| Endpoint | Method | Purpose |
|---|---|---|
/ |
GET | Serves the web UI |
/api/status |
GET | Profile, providers, TTS runtime, settings |
/api/settings |
GET | Current runtime settings |
/api/settings |
PATCH | Update sampler/names/additional prompt |
/api/settings/character |
POST | Import character card (JSON body) |
/api/settings/character/upload |
POST | Upload character card file (PNG or JSON) |
/api/settings/character |
DELETE | Clear active character card |
/api/tts/select |
POST | Select TTS preset |
/api/tts/load |
POST | Load TTS model |
/api/tts/unload |
POST | Unload TTS model |
/api/tts/voice |
POST | Select TTS voice |
/ws/events |
WebSocket | Bidirectional event stream |
install.*: verifies Python 3.11+, creates or reuses.venv, installs Python dependencies, and prints next steps.start.*: launches the local web/backend server in the foreground. It honorsHARNESS_PROFILEandHARNESS_PORT, refuses to start if the target port is already occupied, and prints the local URL. Windows also adds NVIDIA package bin directories from.venvtoPATHfor CUDA ASR.stop.*: stops this harness if it is listening onHARNESS_PORT(default7860). It leaves unrelated processes alone and force-stops only if graceful shutdown times out.doctor.*: checks Python, optional GPU tooling, optional llama.cpp server, and profile status.update.*: updates this checkout fromorigin/mainonly. It refuses to run with uncommitted changes, uses fast-forward-only merge behavior, and reinstalls requirements whenrequirements.txtchanged.
Examples:
.\start.ps1 -Port 7861 -Profile profiles/mock-local.json
.\update.ps1HARNESS_PORT=7861 HARNESS_PROFILE=profiles/mock-local.json ./start.sh
./update.shThe update scripts intentionally do not pull from the current branch if it is Beta or another working branch; origin/main is the only update source.
Profiles live in profiles/*.json. Available profiles:
| Profile | ASR | TTS | Notes |
|---|---|---|---|
llamacpp-cuda-asr.json |
CUDA float16 | Pocket TTS | Default. Requires CUDA 12 runtime. |
llamacpp-local.json |
CPU int8 | Pocket TTS | CPU fallback. No CUDA needed. |
llamacpp-kokoro-onnx.json |
CPU int8 | Kokoro ONNX | Kokoro TTS on CPU. |
mock-local.json |
Mock | Mock | No-model smoke test for UI and orchestration. |
To use a non-default profile:
$env:HARNESS_PROFILE="profiles/mock-local.json"
.\start.ps1The harness talks to llama.cpp through its OpenAI-compatible server API. Start llama-server separately, then use any llamacpp-* profile.
Example CUDA launch:
llama-server -m C:\models\your-model.gguf --host 127.0.0.1 --port 8080 --alias local-gguf -ngl 999Then verify:
.\doctor.ps1
.\start.ps1Notes:
- The harness uses
/v1/chat/completionswith streaming enabled. doctorchecks/healthand/v1/modelsso missing or still-loading models are visible.- On startup, the harness fetches
/propsfrom llama.cpp to learn the server's current sampler defaults and pre-fills the Settings UI. - Vulkan can be useful for llama.cpp on non-CUDA systems, but it is LLM-only here and should be treated as experimental for this voice stack.
- The default profile uses
"model": "auto"and selects the first model reported by/v1/models. - If you prefer a fixed ID, set
llm.modelto the exact model ID reported by/v1/models, or launchllama-serverwith a matching--alias.
The server runs a lifespan handler that cleans up on Ctrl+C or SIGTERM:
- Cancels all active TTS playback
- Calls
unload()on all providers (releases CUDA GPU memory from ASR, drops VAD model) - Discards all active WebSocket queues
stop.ps1 tries graceful shutdown first with a 5-second timeout, then force-kills if needed.
Run tests:
.\.venv\Scripts\python -m pytestOptional live llama.cpp integration test:
$env:HARNESS_TEST_LLAMA_CPP="1"
$env:HARNESS_LLAMA_CPP_MODEL="auto"
.\.venv\Scripts\python -m pytest tests\test_llamacpp_live.py