Bug
DeepSeek V3 16B training crashes during loss.backward() with RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet when using flex_attention. The issue persists regardless of whether torch.compile is enabled or disabled.
Reproduction
# With compile (default config has compile=True, components=["loss"])
NCCL_NVLS_ENABLE=0 torchrun --nnodes 1 --nproc-per-node 8 -m torchtitan.train \
--module deepseek_v3 --config deepseek_v3_16b \
--parallelism.tensor_parallel_degree 8 \
--parallelism.context_parallel_degree 1 \
--parallelism.expert_parallel_degree 2 \
--training.steps 10 \
--dataloader.dataset c4_test
# Without compile (same error)
NCCL_NVLS_ENABLE=0 torchrun --nnodes 1 --nproc-per-node 8 -m torchtitan.train \
--module deepseek_v3 --config deepseek_v3_16b \
--parallelism.tensor_parallel_degree 8 \
--parallelism.context_parallel_degree 1 \
--parallelism.expert_parallel_degree 2 \
--training.steps 10 \
--dataloader.dataset c4_test \
--compile.enable false
Works on commit: 73680eedb7a03635b246a598f3126ca3d945a710
Broken on commit: 786e26f8ee47ffecb523a661535e71031583ff60
Environment
- 8x H100 GPUs
- PyTorch:
2.13.0.dev20260417+cu126
- torchao: nightly
Error
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet.
If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel.
The error occurs during loss.backward() on all ranks.
Bug
DeepSeek V3 16B training crashes during
loss.backward()withRuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yetwhen using flex_attention. The issue persists regardless of whethertorch.compileis enabled or disabled.Reproduction
Works on commit:
73680eedb7a03635b246a598f3126ca3d945a710Broken on commit:
786e26f8ee47ffecb523a661535e71031583ff60Environment
2.13.0.dev20260417+cu126Error
The error occurs during
loss.backward()on all ranks.