More advanced models like Zyphra Zonos or the en-US-Chirp-HD-D from Google sound like the whole environment changes between each different audio generation. Google is usable, but Kokoro is way more consistent. Currently, I'm already passing a seed to Zyphra but still the audio sounds very different between clips.
One way to do this would be to run whisper on the generated audio and use the SRT timestamps to figure out where the slides should transition. This means handling wrongly generated SRT and guessing which text is the right match. This would probably improve the quality, but it also would require much more tokens and running whisper locally for each video.
More advanced models like Zyphra Zonos or the
en-US-Chirp-HD-Dfrom Google sound like the whole environment changes between each different audio generation. Google is usable, but Kokoro is way more consistent. Currently, I'm already passing a seed to Zyphra but still the audio sounds very different between clips.One way to do this would be to run whisper on the generated audio and use the SRT timestamps to figure out where the slides should transition. This means handling wrongly generated SRT and guessing which text is the right match. This would probably improve the quality, but it also would require much more tokens and running whisper locally for each video.