[BUG]: VevoSing AR Inference Crash——Generated Token ID 17925 (BOS) triggers CUDA OOB Error in Flow Matching

Hi, thank you for the great work on Amphion and VevoSing.

I am currently trying to reproduce the **VevoSing** inference pipeline using the provided pretrained models. During the AR inference stage (`inference_ar_and_fm`), I encountered a `RuntimeError: CUDA error: device-side assert triggered`.

After debugging, I found that the AR model generates **out-of-bound token IDs** (specifically `17925`) in `predicted_coco_codecs`. When these tokens are passed to the Flow Matching model's embedding layer (`self.fmt_model.cond_emb`), which has an embedding size of `16384`, it causes an index out of bounds error.

### Error Logs

**1. The generated tensor containing abnormal tokens:**
During `inference_ar_and_fm`, the `predicted_coco_codecs` tensor contains `17925`:

```python
tensor([[16064,  9219,  3187,  9222, 12787,  4636,  4356,  5628,  5901, 13781,
          9878,  3230,   782, 13419,  5683,   864, 13715,  2591,    39,   146,
          2108,  9606,  9455, 16096,   714,  7614, 10896,  3992,  3992, 14441,
         15921, 15736, 12926,  5172, 14970, 10007, 12979,  1559, 11735, 12023,
          2871,  3636,  4928,  1025,  7544, **17925**,  4058, **17925**,  3923,  8141,
          6793,  1936, 15592, 13071, 13634, 11571]], device='cuda:0')

```

**2. Traceback:**

```text
pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [67,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  ...
  File "/mnt/data/menghao/Code/Amphion/models/svc/vevosing/infer_vevosing_ar.py", line 98, in vevosing_melody_control
    gen_audio = inference_pipeline.inference_ar_and_fm(
  File "/mnt/data/menghao/Code/Amphion/models/svc/vevosing/vevosing_utils.py", line 733, in inference_ar_and_fm
    diffusion_cond = self.fmt_model.cond_emb(diffusion_input_codecs)  # [1, T, D]
  File "...", line 190, in forward
    return F.embedding(
  File "...", line 2551, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered

```

### Analysis

Based on `models/svc/autoregressive_transformer/ar_model.py`:

* `content_vocab_size` = 1024
* `style_vocab_size` = 512
* `content_style_vocab_size` = 16384
* `pad_token_id` = 1024 + 512 + 16384 = 17920
* **`content_style_bos_token_id`** = `pad_token_id` + 5 = **17925**

It seems the AR model is generating the **BOS special token (`17925`)** in the middle of the sequence. However, the downstream Flow Matching model (`fmt_model`) expects inputs strictly within the range `[0, 16384)`, representing the content-style codec codebook. It does not know how to handle the AR model's special tokens.

### Questions

1. **Why does the AR model generate the BOS token (`17925`) in the middle of inference?** Is this expected behavior for the VevoSing AR model, or does it indicate an issue with the input prompt/configuration?
2. **How should this be handled?**
* Should we apply a `LogitsProcessor` during `ar_model.generate` to mask out special tokens (indices >= 16384) so they are never sampled?
* Or should we manually filter/clamp the `predicted_coco_codecs` in `vevosing_utils.py` before passing them to the Flow Matching model?



Any insights or recommended fixes would be appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: VevoSing AR Inference Crash——Generated Token ID 17925 (BOS) triggers CUDA OOB Error in Flow Matching #476

Error Logs

Analysis

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: VevoSing AR Inference Crash——Generated Token ID 17925 (BOS) triggers CUDA OOB Error in Flow Matching #476

Description

Error Logs

Analysis

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions