Historically, TTS voice and language creation has been a laborious and complex task that could only be pulled off by programmers and data scientists. This is beginning to change thanks to the rise of neural TTS models, but unfortunately not everyone has the necessary hardware capable of running these computationally expensive models.
Under certain circumstances, neural models are less practical than older speech systems such as formant synthesis, which are designed for speed. Some neural TTS systems and even non-TTS-related applications, such as subaligner, use formant synthesizers like eSpeak-NG as a backbone. Unfortunately, this means that in order to support multiple languages, Subaligner depends on eSpeak-NG’s language support. eSpeak-NG has good support for a majority of European languages, but it fails at accurately parsing other languages, especially those that use special, non-Latin characters.
As long as languages and voices have to be manually created and tuned by hand by professionals, many languages might never be improved or even created in the first place. We believe anyone with a decent computer and recording equipment should be able to contribute to voice and language creation for any TTS system. Voice-Creator-Studio, referred to as VCS for the rest of this document, aims to make that happen.
The interface should be similar, but not identical, to that of the Piper Recording Studio interface, referred to as PRS for the rest of this document. Much like PRS, VCS should be a server-hosted web app. The user records their voice, and the app takes care of cleaning up the audio and transcribing it in the background, though the user will be able to fix transcription errors if need be. Where VCS differs from PRS is that the user will be able to create multiple voice projects or download voice projects from a repository if they don’t want to record their own voice. Safety precautions will be paramount to ensure bad actors cannot easily steal other people’s voices.
VCS should allow the user to train voices for:
- Neural engines such as Piper or Tortus
- HMM-based systems such as RH Voice, Festival, Flite, or Open JTalk
- Concatenative systems such as Festival or Flite
- Singing synthesis standards such as Utau, Diff Singer, and NNSVS / ENUNU
- Any other system that supports plugins (more on plugins later)
Most notably, VCS should allow the user to create a language configuration for formant systems such as eSpeak-NG simply by recording a voice in that language. Using machine-learning models, VCS will do all the heavy lifting—analyzing intonation and context patterns, writing parameters, etc.—and will compile the synthesizer if need be. The user should never have to worry about any of the complicated parts of voice or language creation for any synthesizer unless they choose to.
- Neural engines (e.g., Piper, Tortus): Training data lives in a
wavs/directory of mono 16‑kHz PCM files with a companionmetadata.csvcontainingid|transcript|normalized_textentries. Language‑specific text normalization occurs during preprocessing, and datasets are split into train/validation sets. - HMM-based engines (e.g., RH Voice, Festival, Flite, Open JTalk): Audio is resampled to 22.05 kHz or 24 kHz and paired with
.labfiles or manifests. Preprocessing produces contextual labels and duration files consumed by the trainer. - Concatenative engines (Festival/Flite): Larger recordings include annotated label files; preprocessing segments them and extracts pitch and phoneme features for later concatenation.
- Singing-synthesis engines (Utau, DiffSinger, NNSVS/ENUNU): Datasets combine audio with pitch/timing annotations such as MusicXML or UST. Preprocessing converts these to engine‑specific score and acoustic feature representations.
Training and cleanup tasks run asynchronously through Celery with Redis as broker and result store. Each job receives a unique ID, and workers are allocated according to available CPU or GPU resources. Dedicated queues and worker concurrency limits prevent oversubscription and allow GPU‑enabled jobs to be routed separately from CPU‑only work.
Workers publish progress percentages and log snippets to Redis. Clients poll the /progress/{job_id} endpoint or subscribe to a WebSocket feed to display progress bars. Users can cancel a running task, which preserves intermediate checkpoints; resuming queues a new job that picks up from the saved state.
VCS is intended to be licensed under the AGPL to ensure all improvements are given back to the open-source community. That said, not all speech engines are licensed under the GPL family of licenses, and the goal of VCS is to make it possible to train voices for as many engines as possible. To that end, a developer should be able to write plugins for new speech engines, and those plugins should be permitted to be licensed separately from VCS itself. We can’t include Piper with VCS due to licensing complications, but we would be able to write a Piper plugin and license it under the MIT license or another similarly permissive license. Plugins will also be able to add other improvements besides support for additional TTS engines.
- Language/Framework: Python with FastAPI provides an async REST API that can easily integrate with machine-learning libraries used for training and synthesis.
- Packaging: Poetry manages dependencies and virtual environments.
- Audio Cleanup: noisereduce removes background noise and pydub handles normalization.
- Transcription: OpenAI Whisper (medium model) produces transcripts stored as editable JSON with word-level timestamps.
- Asynchronous Jobs: Celery with Redis runs background tasks and streams progress updates to clients via WebSockets (see docs/asynchronous-jobs.md).
- Framework: React with TypeScript via Vite for fast development and production builds.
- Styling: Tailwind CSS supplies a utility-first approach that keeps styles consistent and composable.
- Code Quality:
pre-commitruns Black, isort, Flake8, and MyPy for Python code, plus ESLint and Prettier for the frontend. - Environment: Docker images ensure reproducible development and deployment setups.
- Continuous Integration: GitHub Actions lint and test both backend and frontend on every pull request.
- Deployment: Successful builds publish a multi-service Docker image that serves the FastAPI backend and the static React frontend.
See docs/deployment.md for container deployment targets, scaling strategies for web traffic and background workers, and hardware requirements for training and inference.
See docs/project-management.md for details on project fields, repository import/export formats, and switching or sharing projects.
Use OAuth 2.0 with JWT tokens:
- Rely on a trusted provider to handle user login and issue short-lived, signed JWTs for every request.
- Store minimal data in each token (user ID, roles, expiration) and rotate signing keys regularly.
- Provide refresh tokens only for long-lived sessions when necessary.
- Roles:
admin,creator,consumer,guest.admin: full administrative access.creator: upload/create voices and manage their own content.consumer: browse, purchase, and listen to authorized voices.guest: limited read-only access to public samples.
- Permissions: Map roles to allowed actions at endpoint or resource level.
- Rate limiting: Apply per-IP and per-token limits (e.g., 100 requests/minute). Use tighter limits for resource-intensive tasks like voice generation or downloads (e.g., 10 generations/hour) and allow higher thresholds for privileged roles.
- Access Control on Storage – store assets in secure buckets with strict IAM policies, short-lived signed URLs, and encryption at rest and in transit.
- Authorization Checks – verify JWT signature, expiration, and permissions on every request and log access for auditing.
- Download Restrictions – throttle repeated download attempts and watermark files to trace leaks.
- Monitoring & Alerts – watch for unusual download patterns and alert on anomalies.
- Revocation & Expiration – enable rapid revocation of tokens or signed URLs and allow users to revoke their own tokens.