Skip to content

Magpie TTS duplicates audio at end of generation #15300

@poptrb

Description

@poptrb

Describe the bug

Magpie TTS will glitch and generate duplicated audio at the end of a generation, multiple times, with seemingly no pattern. The repetitions are always different intonations. For example, the text: "Hello" might produce: "Hello! ello!", "Hello! HELLOU!" (different, shouting intonation), "Helouloulouloulalala" (gibberish pronounciation, hallucinated words). I initially thought this was a problem with very short utterances, but it seems to manifest itself even for longer utterances, like sentences of five words or even more. If the text gets long enough, the issue seems to manifest less frequently, but it still does. In very, very rare scenarios, the audio that is repeated is not even the last segment, it is maybe second or third to last, but very close to the last portion.

Test have been conducted in the official nvcr.io/nvidia/nemo containers, both versions 25.11.01 and 25.09, as you suggested in another issue opened by me.

Steps/Code to reproduce bug

Here's a minimal script for testing. I'm dropping into pdb to generate examples on demand.

import asyncio

from nemo.collections.tts.models import MagpieTTSModel
from loguru import logger
import soundfile as sf
import torch
import os
import time

async def load_model(model_id: str = "nvidia/magpie_tts_multilingual_357m"):
    """Load Magpie TTS model."""

    logger.info(f"Loading Magpie TTS model: {model_id}")

    def _load():

        hf_token = os.environ.get("HUGGINGFACE_ACCESS_TOKEN") or os.environ.get("HF_TOKEN")

        if hf_token:
            os.environ["HF_TOKEN"] = hf_token

        model = MagpieTTSModel.from_pretrained(model_id)
        model = model.cuda()
        model.eval()
        return model

    start = time.time()
    _model = await asyncio.to_thread(_load)
    elapsed = time.time() - start
    logger.info(f"Magpie TTS model loaded in {elapsed:.1f}s")

    return _model


async def main():
    model = await load_model()

    text = 'Hello! Hi! How are you?'

    def generate(text: str):
        with torch.no_grad():
            audio, audio_len = model.do_tts(
                text,
                language="en",
                speaker_index=2,
                apply_TN=False,
            )

        audio_bytes = audio.float().detach().cpu().numpy()
        audio_bytes = audio_bytes.squeeze()
        sf.write("/app/debug/test.wav", audio_bytes, 22000)

    import pdb; pdb.set_trace()



if __name__ == "__main__":
    asyncio.run(main())

Then:

generate('Hi!')
# produces: Hi! HI! (second Hi different intonation, a bit cut off)
generate('Wow that\'s crazy!')
# produces: Wow that's crazy zy!
# or: Wow that's crazy crazy!
generate('Sure! Here are some interesting facts about GPUs')
#produces:  Sure! Here are some interesting facts about GPUs about GPUsZszs' (repetition, gibberish)

Expected behavior

Magpie TTS produces audio with no repetitions for short to medium utterances.

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of NeMo install: [pip install or from source]. mounted NeMo repository inside container. Tried branches master(527b8c4), v2.6.0, v2.6.1, no difference
  • If method of install is [Docker], provide docker pull & docker run commands used:
docker pull nvcr.io/nvidia/nemo:25.11.01
docker pull nvcr.io/nvidia/nemo:25.09

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

Add any other context about the problem here.
Example: GPU model: RTX 3090

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions