-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Describe the bug
Magpie TTS will glitch and generate duplicated audio at the end of a generation, multiple times, with seemingly no pattern. The repetitions are always different intonations. For example, the text: "Hello" might produce: "Hello! ello!", "Hello! HELLOU!" (different, shouting intonation), "Helouloulouloulalala" (gibberish pronounciation, hallucinated words). I initially thought this was a problem with very short utterances, but it seems to manifest itself even for longer utterances, like sentences of five words or even more. If the text gets long enough, the issue seems to manifest less frequently, but it still does. In very, very rare scenarios, the audio that is repeated is not even the last segment, it is maybe second or third to last, but very close to the last portion.
Test have been conducted in the official nvcr.io/nvidia/nemo containers, both versions 25.11.01 and 25.09, as you suggested in another issue opened by me.
Steps/Code to reproduce bug
Here's a minimal script for testing. I'm dropping into pdb to generate examples on demand.
import asyncio
from nemo.collections.tts.models import MagpieTTSModel
from loguru import logger
import soundfile as sf
import torch
import os
import time
async def load_model(model_id: str = "nvidia/magpie_tts_multilingual_357m"):
"""Load Magpie TTS model."""
logger.info(f"Loading Magpie TTS model: {model_id}")
def _load():
hf_token = os.environ.get("HUGGINGFACE_ACCESS_TOKEN") or os.environ.get("HF_TOKEN")
if hf_token:
os.environ["HF_TOKEN"] = hf_token
model = MagpieTTSModel.from_pretrained(model_id)
model = model.cuda()
model.eval()
return model
start = time.time()
_model = await asyncio.to_thread(_load)
elapsed = time.time() - start
logger.info(f"Magpie TTS model loaded in {elapsed:.1f}s")
return _model
async def main():
model = await load_model()
text = 'Hello! Hi! How are you?'
def generate(text: str):
with torch.no_grad():
audio, audio_len = model.do_tts(
text,
language="en",
speaker_index=2,
apply_TN=False,
)
audio_bytes = audio.float().detach().cpu().numpy()
audio_bytes = audio_bytes.squeeze()
sf.write("/app/debug/test.wav", audio_bytes, 22000)
import pdb; pdb.set_trace()
if __name__ == "__main__":
asyncio.run(main())
Then:
generate('Hi!')
# produces: Hi! HI! (second Hi different intonation, a bit cut off)
generate('Wow that\'s crazy!')
# produces: Wow that's crazy zy!
# or: Wow that's crazy crazy!
generate('Sure! Here are some interesting facts about GPUs')
#produces: Sure! Here are some interesting facts about GPUs about GPUsZszs' (repetition, gibberish)
Expected behavior
Magpie TTS produces audio with no repetitions for short to medium utterances.
Environment overview (please complete the following information)
- Environment location: Docker
- Method of NeMo install: [pip install or from source]. mounted NeMo repository inside container. Tried branches master(
527b8c4),v2.6.0,v2.6.1, no difference - If method of install is [Docker], provide
docker pull&docker runcommands used:
docker pull nvcr.io/nvidia/nemo:25.11.01
docker pull nvcr.io/nvidia/nemo:25.09
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version
Additional context
Add any other context about the problem here.
Example: GPU model: RTX 3090