Skip to content

DeepSeek V3 16B crashes with 'tensor data not allocated' during backward with flex_attention + compile #3128

@acisseJZhong

Description

@acisseJZhong

Bug

DeepSeek V3 16B training crashes during loss.backward() with RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet when using flex_attention. The issue persists regardless of whether torch.compile is enabled or disabled.

Reproduction

# With compile (default config has compile=True, components=["loss"])
NCCL_NVLS_ENABLE=0 torchrun --nnodes 1 --nproc-per-node 8 -m torchtitan.train \
  --module deepseek_v3 --config deepseek_v3_16b \
  --parallelism.tensor_parallel_degree 8 \
  --parallelism.context_parallel_degree 1 \
  --parallelism.expert_parallel_degree 2 \
  --training.steps 10 \
  --dataloader.dataset c4_test

# Without compile (same error)
NCCL_NVLS_ENABLE=0 torchrun --nnodes 1 --nproc-per-node 8 -m torchtitan.train \
  --module deepseek_v3 --config deepseek_v3_16b \
  --parallelism.tensor_parallel_degree 8 \
  --parallelism.context_parallel_degree 1 \
  --parallelism.expert_parallel_degree 2 \
  --training.steps 10 \
  --dataloader.dataset c4_test \
  --compile.enable false

Works on commit: 73680eedb7a03635b246a598f3126ca3d945a710
Broken on commit: 786e26f8ee47ffecb523a661535e71031583ff60

Environment

  • 8x H100 GPUs
  • PyTorch: 2.13.0.dev20260417+cu126
  • torchao: nightly

Error

RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet.
If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel.

The error occurs during loss.backward() on all ranks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions