[dist] bug: distribute_tensor uses src_data_rank=0 redundantly in load_model_weights when all ranks load from disk

Hi, I noticed the load_model_weights path fires distribute_tensor(src_data_rank=0) → dist.scatter() during startup, even though every rank already reads the full checkpoint from disk. Passing src_data_rank=None eliminates the scatter with zero correctness impact — I A/B tested across 5 models (Qwen2/Qwen3/MoE, 0.5B–14B) and got bit-for-bit identical losses.

The startup time improvement scales with model size (−10% at 0.5B, −22% at 3B, −18% at MoE-14B). Details and fix below.


<br class="Apple-interchange-newline">
</body>
</html>

What does this bug do?

In _build_fsdp2_model, load_model_weights is called with the default dtensor_factory=distribute_tensor. This calls distribute_tensor(full_tensor, mesh, [Shard(0)]) with the default src_data_rank=0, which triggers:

```
load_model_weights()
  → _dispatch_parameter()
    → distribute_tensor(full_tensor, mesh, [Shard(0)])   # src_data_rank=0 (default)
      → Shard._shard_tensor()
        → mesh_scatter(output, chunks, group_src=0)       # ← fires dist.scatter()
```

However, load_model_weights already reads the full checkpoint on every rank independently — the existing log even says:

_"Every rank would read weights from disk and expect this to be slow!"_

So every rank holds the complete tensor before distribute_tensor is called. The scatter from rank 0 is 100% redundant: rank 0 sends chunks to ranks that already have identical data locally.

<html>
<body>
<p style="margin: 0px 0px 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><strong style="font-weight: 600;">Impact</strong></p><p style="margin: 0px 0px 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">On all backends: wastes up to<span> </span><code style="font-family: Consolas, &quot;Courier New&quot;, monospace; font-size: 10.998px; color: rgb(208, 208, 208); background-color: rgb(60, 60, 60); padding: 1px 3px; border-radius: 4px; border-color: rgb(208, 208, 208); border-style: none; border-width: 0px; border-image: none 100% / 1 / 0 stretch; white-space: pre-wrap;">(world_size-1)/world_size × model_size</code><span> </span>of network bandwidth per training startup, and the overhead scales linearly with model size and number of parameters:</p><div class="monaco-scrollable-element rendered-markdown-table-scroll-wrapper" role="presentation" style="width: 918px; margin-bottom: 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial; position: relative; overflow: hidden;"><div style="overflow: hidden;">

Model | Family | Scatter calls | Wire bytes wasted | Load time DEFAULT → FIX
-- | -- | -- | -- | --
Qwen2.5-0.5B | qwen2 dense | 290 | 0.988 GB | 0.936s → 0.841s (−10%)
Qwen2.5-1.5B | qwen2 dense | 338 | 3.087 GB | 2.746s → 2.369s (−14%)
Qwen3-0.6B | qwen3 dense | 310 | 1.192 GB | 1.261s → 0.976s (−23%)
Qwen2.5-3B | qwen2 dense | 434 | 6.172 GB | 5.832s → 4.520s (−22%)
Qwen1.5-MoE-A2.7B | qwen2_moe MoE | 4,659 | 28.632 GB | 22.4s → 18.4s (−18%)

</div><div role="presentation" aria-hidden="true" class="invisible scrollbar horizontal" style="opacity: 0; pointer-events: none; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0); position: absolute; width: 918px; height: 10px; left: 0px; bottom: 0px;"><div class="slider" style="background: none 0% 0% / auto repeat scroll padding-box border-box rgba(121, 121, 121, 0.4); position: absolute; top: 0px; left: 0px; height: 10px; transform: translate3d(0px, 0px, 0px); contain: strict; width: 918px;"></div></div><div role="presentation" aria-hidden="true" class="invisible scrollbar vertical" style="opacity: 0; pointer-events: none; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0); z-index: 14; position: absolute; width: 0px; height: 170px; right: 0px; top: 0px;"><div class="slider" style="background: none 0% 0% / auto repeat scroll padding-box border-box rgba(121, 121, 121, 0.4); position: absolute; top: 0px; left: 0px; width: 10px; transform: translate3d(0px, 0px, 0px); contain: strict; height: 170px;"></div></div></div><p style="margin: 0px 0px 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">On backends without P2P IPC support (e.g. XCCL on PCIe):<span> </span><code style="font-family: Consolas, &quot;Courier New&quot;, monospace; font-size: 10.998px; color: rgb(208, 208, 208); background-color: rgb(60, 60, 60); padding: 1px 3px; border-radius: 4px; border-color: rgb(208, 208, 208); border-style: none; border-width: 0px; border-image: none 100% / 1 / 0 stretch; white-space: pre-wrap;">dist.scatter()</code><span> </span>causes a hard hang.</p><hr style="border-color: rgba(255, 255, 255, 0.18); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><p style="margin: 0px 0px 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><strong style="font-weight: 600;">Fix</strong></p><p style="margin: 0px 0px 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">Pass<span> </span><code style="font-family: Consolas, &quot;Courier New&quot;, monospace; font-size: 10.998px; color: rgb(208, 208, 208); background-color: rgb(60, 60, 60); padding: 1px 3px; border-radius: 4px; border-color: rgb(208, 208, 208); border-style: none; border-width: 0px; border-image: none 100% / 1 / 0 stretch; white-space: pre-wrap;">src_data_rank=None</code><span> </span>so PyTorch performs a local tensor split with zero communication. Per PyTorch DTensor docs: when<span> </span><code style="font-family: Consolas, &quot;Courier New&quot;, monospace; font-size: 10.998px; color: rgb(208, 208, 208); background-color: rgb(60, 60, 60); padding: 1px 3px; border-radius: 4px; border-color: rgb(208, 208, 208); border-style: none; border-width: 0px; border-image: none 100% / 1 / 0 stretch; white-space: pre-wrap;">src_data_rank=None</code>, each rank slices its own local chunk — which is correct because all ranks already hold the full tensor.</p><p style="margin: 0px 0px 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">In<span> </span><code style="font-family: Consolas, &quot;Courier New&quot;, monospace; font-size: 10.998px; color: rgb(208, 208, 208); background-color: rgb(60, 60, 60); padding: 1px 3px; border-radius: 4px; border-color: rgb(208, 208, 208); border-style: none; border-width: 0px; border-image: none 100% / 1 / 0 stretch; white-space: pre-wrap;">veomni/distributed/torch_parallelize.py</code>, inside<span> </span><code style="font-family: Consolas, &quot;Courier New&quot;, monospace; font-size: 10.998px; color: rgb(208, 208, 208); background-color: rgb(60, 60, 60); padding: 1px 3px; border-radius: 4px; border-color: rgb(208, 208, 208); border-style: none; border-width: 0px; border-image: none 100% / 1 / 0 stretch; white-space: pre-wrap;">_build_fsdp2_model</code>:</p>
</body>
</html>

# BEFORE (VeOmni 0.1.4):
load_model_weights(model, weights_path, get_device_type(), dtensor_factory=distribute_tensor)

# AFTER:
import functools
# Every rank already read the full checkpoint from disk, so scatter is redundant.
# src_data_rank=None → local split only, zero collective communication.
_dt_local_split = functools.partial(distribute_tensor, src_data_rank=None)
load_model_weights(model, weights_path, get_device_type(), dtensor_factory=_dt_local_split)


⚠️ Scope: this fix applies only to the load_model_weights (every-rank-reads) path. The rank0_load_and_broadcast_weights path should keep the default src_data_rank=0 — that path legitimately has only rank 0 reading the checkpoint.


<html>
<body>
<p style="margin: 0px 0px 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><strong style="font-weight: 600;">Correctness Verification — A100 PCIe, 2-GPU</strong></p><p style="margin: 0px 0px 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">A/B tested on 5 models across 2 architecture families, DP=2 and SP=2 modes. Losses are<span> </span><strong style="font-weight: 600;">bit-for-bit identical</strong><span> </span>to 6 decimal places between DEFAULT and FIX:</p><div class="monaco-scrollable-element rendered-markdown-table-scroll-wrapper" role="presentation" style="width: 918px; margin-bottom: 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial; position: relative; overflow: hidden;"><div style="overflow: hidden;">

Model | Step 1 | Step 2 | Step 3 | Step 4 | Step 5 | Match
-- | -- | -- | -- | -- | -- | --
Qwen2.5-0.5B (DP=2) | 2.515763 | 0.822851 | 3.626765 | 2.051105 | 0.154655 | ✅
Qwen2.5-0.5B (SP=2) | 4.134645 | 1.723882 | 0.514100 | 0.124871 | 0.009825 | ✅

</div><div role="presentation" aria-hidden="true" class="invisible scrollbar horizontal" style="opacity: 0; pointer-events: none; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0); position: absolute; width: 918px; height: 10px; left: 0px; bottom: 0px;"><div class="slider" style="background: none 0% 0% / auto repeat scroll padding-box border-box rgba(121, 121, 121, 0.4); position: absolute; top: 0px; left: 0px; height: 10px; transform: translate3d(0px, 0px, 0px); contain: strict; width: 918px;"></div></div><div role="presentation" aria-hidden="true" class="invisible scrollbar vertical" style="opacity: 0; pointer-events: none; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0); z-index: 14; position: absolute; width: 0px; height: 85px; right: 0px; top: 0px;"><div class="slider" style="background: none 0% 0% / auto repeat scroll padding-box border-box rgba(121, 121, 121, 0.4); position: absolute; top: 0px; left: 0px; width: 10px; transform: translate3d(0px, 0px, 0px); contain: strict; height: 85px;"></div></div></div><hr style="border-color: rgba(255, 255, 255, 0.18); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><p style="margin: 0px 0px 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><strong style="font-weight: 600;">Why<span> </span><code style="font-family: Consolas, &quot;Courier New&quot;, monospace; font-size: 10.998px; color: rgb(208, 208, 208); background-color: rgb(60, 60, 60); padding: 1px 3px; border-radius: 4px; border-color: rgb(208, 208, 208); border-style: none; border-width: 0px; border-image: none 100% / 1 / 0 stretch; white-space: pre-wrap;">src_data_rank=None</code><span> </span>is the correct API here</strong></p><p style="margin: 0px 0px 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">From PyTorch<span> </span><code style="font-family: Consolas, &quot;Courier New&quot;, monospace; font-size: 10.998px; color: rgb(208, 208, 208); background-color: rgb(60, 60, 60); padding: 1px 3px; border-radius: 4px; border-color: rgb(208, 208, 208); border-style: none; border-width: 0px; border-image: none 100% / 1 / 0 stretch; white-space: pre-wrap;">torch/distributed/tensor/_api.py</code>,<span> </span><code style="font-family: Consolas, &quot;Courier New&quot;, monospace; font-size: 10.998px; color: rgb(208, 208, 208); background-color: rgb(60, 60, 60); padding: 1px 3px; border-radius: 4px; border-color: rgb(208, 208, 208); border-style: none; border-width: 0px; border-image: none 100% / 1 / 0 stretch; white-space: pre-wrap;">Shard._shard_tensor</code>:</p>
</body>
</html>

if src_data_rank is None:
    # NO communication — local split only
    chunks = self._split_tensor(tensor, num_chunks, ...)
    return chunks[my_rank]          # ← local slice, zero network I/O
else:
    # COMMUNICATION — scatter from src_data_rank
    mesh_scatter(output, chunks, mesh, group_src=src_data_rank)  # ← dist.scatter()

<html>
<body>
<p style="margin: 0px 0px 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><code style="font-family: Consolas, &quot;Courier New&quot;, monospace; font-size: 10.998px; color: rgb(208, 208, 208); background-color: rgb(60, 60, 60); padding: 1px 3px; border-radius: 4px; border-color: rgb(208, 208, 208); border-style: none; border-width: 0px; border-image: none 100% / 1 / 0 stretch; white-space: pre-wrap;">src_data_rank=None</code><span> </span>is a first-class PyTorch API parameter (since PyTorch 2.1), explicitly designed for the case when every rank already has the full tensor. It is not a workaround.</p><hr style="border-color: rgba(255, 255, 255, 0.18); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><p style="margin: 0px 0px 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><strong style="font-weight: 600;">Cross-framework comparison</strong></p><p style="margin: 0px 0px 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">No other major training framework uses<span> </span><code style="font-family: Consolas, &quot;Courier New&quot;, monospace; font-size: 10.998px; color: rgb(208, 208, 208); background-color: rgb(60, 60, 60); padding: 1px 3px; border-radius: 4px; border-color: rgb(208, 208, 208); border-style: none; border-width: 0px; border-image: none 100% / 1 / 0 stretch; white-space: pre-wrap;">scatter</code><span> </span>for weight loading:</p><div class="monaco-scrollable-element rendered-markdown-table-scroll-wrapper" role="presentation" style="width: 918px; margin-bottom: 16px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial; position: relative; overflow: hidden;"><div style="overflow: hidden;">

Framework | Who reads from disk | How weights reach each rank | Uses scatter?
-- | -- | -- | --
TorchTitan | Each rank reads its shard only (DCP) | Direct per-rank shard read | No
VERL FSDP | Rank 0 only | set_model_state_dict(broadcast_from_rank0=True) | No
Megatron-LM | Each rank reads its shard only (pre-sharded files) | Direct per-rank file read | No
VeOmni | Every rank reads full model | distribute_tensor(src_data_rank=0) → scatter | Yes — the bug

</div><div role="presentation" aria-hidden="true" class="invisible scrollbar horizontal" style="opacity: 0; pointer-events: none; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0); position: absolute; width: 918px; height: 10px; left: 0px; bottom: 0px;"><div class="slider" style="background: none 0% 0% / auto repeat scroll padding-box border-box rgba(121, 121, 121, 0.4); position: absolute; top: 0px; left: 0px; height: 10px; transform: translate3d(0px, 0px, 0px); contain: strict; width: 918px;"></div></div><div role="presentation" aria-hidden="true" class="invisible scrollbar vertical" style="opacity: 0; pointer-events: none; background: none 0% 0% / auto repeat scroll padding-box border-box rgba(0, 0, 0, 0); z-index: 14; position: absolute; width: 0px; height: 144px; right: 0px; top: 0px;"><div class="slider" style="background: none 0% 0% / auto repeat scroll padding-box border-box rgba(121, 121, 121, 0.4); position: absolute; top: 0px; left: 0px; width: 10px; transform: translate3d(0px, 0px, 0px); contain: strict; height: 144px;"></div></div></div><p style="margin: 0px; color: rgb(204, 204, 204); font-family: &quot;Segoe WPC&quot;, &quot;Segoe UI&quot;, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(24, 24, 24); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">VeOmni is unique in the "every rank reads full checkpoint" pattern. For that pattern,<span> </span><code style="font-family: Consolas, &quot;Courier New&quot;, monospace; font-size: 10.998px; color: rgb(208, 208, 208); background-color: rgb(60, 60, 60); padding: 1px 3px; border-radius: 4px; border-color: rgb(208, 208, 208); border-style: none; border-width: 0px; border-image: none 100% / 1 / 0 stretch; white-space: pre-wrap;">src_data_rank=None</code><span> </span>is the correct API to use.</p>
</body>
</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dist] bug: distribute_tensor uses src_data_rank=0 redundantly in load_model_weights when all ranks load from disk #637

BEFORE (VeOmni 0.1.4):

AFTER:

Every rank already read the full checkpoint from disk, so scatter is redundant.

src_data_rank=None → local split only, zero collective communication.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Family	Scatter calls	Wire bytes wasted	Load time DEFAULT → FIX
Qwen2.5-0.5B	qwen2 dense	290	0.988 GB	0.936s → 0.841s (−10%)
Qwen2.5-1.5B	qwen2 dense	338	3.087 GB	2.746s → 2.369s (−14%)
Qwen3-0.6B	qwen3 dense	310	1.192 GB	1.261s → 0.976s (−23%)
Qwen2.5-3B	qwen2 dense	434	6.172 GB	5.832s → 4.520s (−22%)
Qwen1.5-MoE-A2.7B	qwen2_moe MoE	4,659	28.632 GB	22.4s → 18.4s (−18%)

Model	Step 1	Step 2	Step 3	Step 4	Step 5	Match
Qwen2.5-0.5B (DP=2)	2.515763	0.822851	3.626765	2.051105	0.154655	✅
Qwen2.5-0.5B (SP=2)	4.134645	1.723882	0.514100	0.124871	0.009825	✅

Framework	Who reads from disk	How weights reach each rank	Uses scatter?
TorchTitan	Each rank reads its shard only (DCP)	Direct per-rank shard read	No
VERL FSDP	Rank 0 only	set_model_state_dict(broadcast_from_rank0=True)	No
Megatron-LM	Each rank reads its shard only (pre-sharded files)	Direct per-rank file read	No
VeOmni	Every rank reads full model	distribute_tensor(src_data_rank=0) → scatter	Yes — the bug

[dist] bug: distribute_tensor uses src_data_rank=0 redundantly in load_model_weights when all ranks load from disk #637

Description

BEFORE (VeOmni 0.1.4):

AFTER:

Every rank already read the full checkpoint from disk, so scatter is redundant.

src_data_rank=None → local split only, zero collective communication.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions