Generate the audio in one go?

More advanced models like Zyphra Zonos or the `en-US-Chirp-HD-D` from Google sound like the whole environment changes between each different audio generation. Google is usable, but Kokoro is way more consistent. Currently, I'm already passing a seed to Zyphra but still the audio sounds very different between clips.

One way to do this would be to run whisper on the generated audio and use the SRT timestamps to figure out where the slides should transition. This means handling wrongly generated SRT and guessing which text is the right match. This would probably improve the quality, but it also would require much more tokens and running whisper locally for each video.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate the audio in one go? #39

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Generate the audio in one go? #39

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions