[MoE][SAC] Use deterministic ops in MoE routing by songhappy · Pull Request #3146 · pytorch/torchtitan

songhappy · 2026-04-28T23:08:35Z

MoE routing uses torch.histc and relies on torch.topk being recomputed under selective activation checkpointing. Neither op is deterministic on all backends (notably Intel XPU), so SAC recompute can produce different expert assignments than the original forward, silently corrupting gradients.

This PR:

Replaces torch.histc with torch.bincount in TokenChoiceTopKRouter and TokenReorderer — bincount is deterministic and equivalent for integer indices.
Adds aten.topk.default to the SAC save list so its output is reused on recompute.
Both changes are no-ops on backends where these ops are already deterministic (e.g. CUDA).

Changes
torchtitan/models/common/moe/moe.py: histc → bincount (×2)
torchtitan/distributed/activation_checkpoint.py: save aten.topk.default
tests/unit_tests/test_moe_routing.py: routing-count and recompute-consistency tests

Testing
python -m unittest tests.unit_tests.test_moe_routing — 3/3 pass on CUDA and Intel XPU. DeepSeek-V3 training on CUDA/XPU now runs cleanly under SAC.

Risk
No API changes. bincount and histc produce identical integer counts for valid inputs (covered by tests). Saving topk adds negligible memory.

…lity Two related changes that fix non-deterministic MoE routing under selective activation checkpointing on backends where torch.histc / torch.topk are not guaranteed deterministic (e.g. XPU): 1. Replace torch.histc with torch.bincount in TokenChoiceTopKRouter and TokenReorderer. histc can produce different counts between the forward and recompute passes on some backends, while bincount is deterministic and functionally equivalent for integer expert indices. 2. Add aten.topk.default to the SAC save list. topk can also be non-deterministic on recompute on some backends. Saving its output (top-k scores + indices per token) is cheap and guarantees stable expert assignments across forward and recompute. Both changes are no-ops on backends where these ops are already deterministic, and avoid silent gradient corruption on those that aren't. Signed-off-by: guoqiong song <guoqiong.song@intel.com>

pytorch-bot · 2026-04-28T23:08:42Z

The following ciflow label(s) have been added but CI has not been triggered yet because the workflows are awaiting approval:

ciflow/8gpu

Once a maintainer approves the workflows (scroll to the bottom of the PR page), the corresponding CI jobs will be triggered automatically. Please ping one of the reviewers if you do not have access to approve and run workflows.

acisseJZhong · 2026-04-29T00:25:24Z

for the torch.hist -> bincount change, wondering could you measure speed / mfu before and after?

another approach we've been seeing used is torch.scatter(zeros, -1, topk_ids, 1) to replace torch.hist, curious to learn what's the performance comparison between these three options.

acisseJZhong · 2026-04-29T00:46:39Z

        # FlexAttention (torch.ops.higher_order.flex_attention is the same object)
        torch._higher_order_ops.flex_attention,
        torch.ops.aten.linear.default,
+        # topk can be non-deterministic on some backends; save to keep MoE


wondering did you try turning on torch determinism mode? is torch.topk not deterministic even under determinism mode?

wondering is it possible to only add to the list for non-deterministic backend(e.g. XPU) instead of doing this for all backends.

songhappy requested review from fegin, tianyu-l, wconstab and wwwjn as code owners April 28, 2026 23:08

pytorch-bot Bot added the ciflow/8gpu label Apr 28, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 28, 2026

tianyu-l requested a review from acisseJZhong April 29, 2026 00:04

acisseJZhong reviewed Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE][SAC] Use deterministic ops in MoE routing#3146

[MoE][SAC] Use deterministic ops in MoE routing#3146
songhappy wants to merge 1 commit intopytorch:mainfrom
songhappy:xpu-moe-determinism

songhappy commented Apr 28, 2026

Uh oh!

pytorch-bot Bot commented Apr 28, 2026

Uh oh!

acisseJZhong commented Apr 29, 2026 •

edited

Loading

Uh oh!

acisseJZhong Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

songhappy commented Apr 28, 2026

Uh oh!

pytorch-bot Bot commented Apr 28, 2026

Uh oh!

acisseJZhong commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

acisseJZhong Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

acisseJZhong commented Apr 29, 2026 •

edited

Loading