Skip to content

[BUG]: VevoSing AR Inference Crash——Generated Token ID 17925 (BOS) triggers CUDA OOB Error in Flow Matching #476

@Arain233

Description

@Arain233

Hi, thank you for the great work on Amphion and VevoSing.

I am currently trying to reproduce the VevoSing inference pipeline using the provided pretrained models. During the AR inference stage (inference_ar_and_fm), I encountered a RuntimeError: CUDA error: device-side assert triggered.

After debugging, I found that the AR model generates out-of-bound token IDs (specifically 17925) in predicted_coco_codecs. When these tokens are passed to the Flow Matching model's embedding layer (self.fmt_model.cond_emb), which has an embedding size of 16384, it causes an index out of bounds error.

Error Logs

1. The generated tensor containing abnormal tokens:
During inference_ar_and_fm, the predicted_coco_codecs tensor contains 17925:

tensor([[16064,  9219,  3187,  9222, 12787,  4636,  4356,  5628,  5901, 13781,
          9878,  3230,   782, 13419,  5683,   864, 13715,  2591,    39,   146,
          2108,  9606,  9455, 16096,   714,  7614, 10896,  3992,  3992, 14441,
         15921, 15736, 12926,  5172, 14970, 10007, 12979,  1559, 11735, 12023,
          2871,  3636,  4928,  1025,  7544, **17925**,  4058, **17925**,  3923,  8141,
          6793,  1936, 15592, 13071, 13634, 11571]], device='cuda:0')

2. Traceback:

pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [67,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  ...
  File "/mnt/data/menghao/Code/Amphion/models/svc/vevosing/infer_vevosing_ar.py", line 98, in vevosing_melody_control
    gen_audio = inference_pipeline.inference_ar_and_fm(
  File "/mnt/data/menghao/Code/Amphion/models/svc/vevosing/vevosing_utils.py", line 733, in inference_ar_and_fm
    diffusion_cond = self.fmt_model.cond_emb(diffusion_input_codecs)  # [1, T, D]
  File "...", line 190, in forward
    return F.embedding(
  File "...", line 2551, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered

Analysis

Based on models/svc/autoregressive_transformer/ar_model.py:

  • content_vocab_size = 1024
  • style_vocab_size = 512
  • content_style_vocab_size = 16384
  • pad_token_id = 1024 + 512 + 16384 = 17920
  • content_style_bos_token_id = pad_token_id + 5 = 17925

It seems the AR model is generating the BOS special token (17925) in the middle of the sequence. However, the downstream Flow Matching model (fmt_model) expects inputs strictly within the range [0, 16384), representing the content-style codec codebook. It does not know how to handle the AR model's special tokens.

Questions

  1. Why does the AR model generate the BOS token (17925) in the middle of inference? Is this expected behavior for the VevoSing AR model, or does it indicate an issue with the input prompt/configuration?
  2. How should this be handled?
  • Should we apply a LogitsProcessor during ar_model.generate to mask out special tokens (indices >= 16384) so they are never sampled?
  • Or should we manually filter/clamp the predicted_coco_codecs in vevosing_utils.py before passing them to the Flow Matching model?

Any insights or recommended fixes would be appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions