-
Notifications
You must be signed in to change notification settings - Fork 796
Description
Hi, thank you for the great work on Amphion and VevoSing.
I am currently trying to reproduce the VevoSing inference pipeline using the provided pretrained models. During the AR inference stage (inference_ar_and_fm), I encountered a RuntimeError: CUDA error: device-side assert triggered.
After debugging, I found that the AR model generates out-of-bound token IDs (specifically 17925) in predicted_coco_codecs. When these tokens are passed to the Flow Matching model's embedding layer (self.fmt_model.cond_emb), which has an embedding size of 16384, it causes an index out of bounds error.
Error Logs
1. The generated tensor containing abnormal tokens:
During inference_ar_and_fm, the predicted_coco_codecs tensor contains 17925:
tensor([[16064, 9219, 3187, 9222, 12787, 4636, 4356, 5628, 5901, 13781,
9878, 3230, 782, 13419, 5683, 864, 13715, 2591, 39, 146,
2108, 9606, 9455, 16096, 714, 7614, 10896, 3992, 3992, 14441,
15921, 15736, 12926, 5172, 14970, 10007, 12979, 1559, 11735, 12023,
2871, 3636, 4928, 1025, 7544, **17925**, 4058, **17925**, 3923, 8141,
6793, 1936, 15592, 13071, 13634, 11571]], device='cuda:0')2. Traceback:
pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [67,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
...
File "/mnt/data/menghao/Code/Amphion/models/svc/vevosing/infer_vevosing_ar.py", line 98, in vevosing_melody_control
gen_audio = inference_pipeline.inference_ar_and_fm(
File "/mnt/data/menghao/Code/Amphion/models/svc/vevosing/vevosing_utils.py", line 733, in inference_ar_and_fm
diffusion_cond = self.fmt_model.cond_emb(diffusion_input_codecs) # [1, T, D]
File "...", line 190, in forward
return F.embedding(
File "...", line 2551, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Analysis
Based on models/svc/autoregressive_transformer/ar_model.py:
content_vocab_size= 1024style_vocab_size= 512content_style_vocab_size= 16384pad_token_id= 1024 + 512 + 16384 = 17920content_style_bos_token_id=pad_token_id+ 5 = 17925
It seems the AR model is generating the BOS special token (17925) in the middle of the sequence. However, the downstream Flow Matching model (fmt_model) expects inputs strictly within the range [0, 16384), representing the content-style codec codebook. It does not know how to handle the AR model's special tokens.
Questions
- Why does the AR model generate the BOS token (
17925) in the middle of inference? Is this expected behavior for the VevoSing AR model, or does it indicate an issue with the input prompt/configuration? - How should this be handled?
- Should we apply a
LogitsProcessorduringar_model.generateto mask out special tokens (indices >= 16384) so they are never sampled? - Or should we manually filter/clamp the
predicted_coco_codecsinvevosing_utils.pybefore passing them to the Flow Matching model?
Any insights or recommended fixes would be appreciated!