Version: v1.2.3 | Status: Active | Last Updated: March 2026
- Name: audio
-
- Category: Media Processing
- Dependencies: logging_monitoring, environment_setup
-
Speech-to-Text Transcription
- Input: Audio files (WAV, MP3, FLAC, OGG, M4A, WEBM)
- Output: TranscriptionResult with text, segments, timing
- Provider: Whisper via faster-whisper
-
Text-to-Speech Synthesis
- Input: Text string
- Output: SynthesisResult with audio data
- Providers: pyttsx3 (offline), edge-tts (neural)
-
Language Detection
- Input: Audio file
- Output: Language code and confidence score
STT_AVAILABLE # Speech-to-text functionality
TTS_AVAILABLE # Text-to-speech functionality
WHISPER_AVAILABLE # Whisper provider available
PYTTSX3_AVAILABLE # Offline TTS available
EDGE_TTS_AVAILABLE # Neural TTS availableclass Transcriber:
def transcribe(audio_path, language=None, **kwargs) -> TranscriptionResult
async def transcribe_async(audio_path, **kwargs) -> TranscriptionResult
async def transcribe_stream(audio_path, **kwargs) -> AsyncIterator[TranscriptionResult]
def detect_language(audio_path) -> tuple[str, float]
def transcribe_batch(audio_paths, **kwargs) -> list[TranscriptionResult]
def get_supported_languages() -> list[str]
def unload() -> Noneclass Synthesizer:
def synthesize(text, voice=None, rate=1.0, **kwargs) -> SynthesisResult
async def synthesize_async(text, **kwargs) -> SynthesisResult
def synthesize_to_file(text, output_path, **kwargs) -> Path
def synthesize_batch(texts, **kwargs) -> list[SynthesisResult]
def list_voices(language=None) -> list[VoiceInfo]
def get_voice(voice_id) -> Optional[VoiceInfo]
def get_supported_languages() -> list[str]TranscriptionResult:
text: str # Full transcription
segments: list[Segment] # Timed segments
language: str # ISO 639-1 code
language_probability: float # 0.0-1.0
duration: float # Seconds
processing_time: float # Seconds
model_size: WhisperModelSize
source_path: Optional[Path]
# Methods
to_srt() -> str
to_vtt() -> str
to_json() -> dict
save_srt(path) -> Path
save_vtt(path) -> PathSynthesisResult:
audio_data: bytes
format: AudioFormat # WAV or MP3
duration: float
sample_rate: int
voice_id: str
text: str
provider: str
processing_time: float
# Methods
save(path) -> Path
size_bytes: int
size_kb: floatVoiceInfo:
id: str # Unique identifier
name: str # Display name
language: str # e.g., "en-US"
gender: VoiceGender # MALE, FEMALE, NEUTRAL
is_neural: bool
provider: str
sample_rate: int
styles: list[str]AudioError (base)
βββ TranscriptionError # STT failures
βββ SynthesisError # TTS failures
βββ AudioFormatError # Invalid format
βββ ModelNotLoadedError # Model not ready
βββ ProviderNotAvailableError # Missing deps
βββ VoiceNotFoundError # Invalid voice
All exceptions include context:
except TranscriptionError as e:
e.context.get("audio_path")
e.context.get("language")
e.context.get("model_size")TranscriptionConfig:
language: Optional[str] # None = auto-detect
task: str # "transcribe" or "translate"
beam_size: int = 5
word_timestamps: bool = True
vad_filter: bool = TrueTTSConfig:
voice: Optional[str]
language: str = "en-US"
rate: float = 1.0 # 0.5-2.0
pitch: float = 1.0
volume: float = 1.0 # 0.0-1.0
format: AudioFormat = WAVlogging_monitoring- Logging infrastructureenvironment_setup- Dependency validation
documents- Save transcriptionsllm- Process transcribed textcoding- Voice-to-code workflows
- Whisper models: 1-10GB VRAM depending on size
- Call
transcriber.unload()when done
- pyttsx3: No network required
- edge-tts: Requires internet
- Create separate instances per thread
- Use async methods for concurrent processing
- Follows semantic versioning
- Breaking changes in major versions only
- Provider additions are minor versions