Add Support for Nemotron diffusion by sajadn · Pull Request #3105 · NVIDIA-NeMo/Megatron-Bridge

sajadn · 2026-04-02T01:30:17Z

What does this PR do ?

Adds support for nemotron-diffusion model to megatron-brdige.

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Release Notes

New Features
- Added NemotronDiffusion model support with full end-to-end workflow for training and inference
- Enabled continuous pretraining for Ministral-3 models
- Introduced autoregressive-to-diffusion language model conversion pipeline
- Added inference utilities supporting both autoregressive and diffusion generation modes
- Provided pre-configured model sizes (3B, 8B, 14B) with example training configurations
Documentation
- Added comprehensive guide for NemotronDiffusion workflow including checkpoint conversion, pretraining, and inference
Tests
- Added extensive test coverage for diffusion model components and inference utilities

copy-pr-bot · 2026-04-02T01:30:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…e model utilities - Updated README to include DMinistral and its features. - Added new modules for DMinistral3 model conversion and utilities, including `DMinistral3Bridge`, `DMinistral3ModelProvider`, and `FlexDotProductAttention`. - Introduced `dgpt_step.py` for training steps specific to diffusion language models. - Implemented masking and attention mechanisms tailored for diffusion models in `dllm.py` and `flex_dot_product_attention.py`. - Established directory structure for DMinistral components and updated conversion bridges accordingly. Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

Introduces the full Nemotron Diffusion recipe including: - AR-to-DLM conversion pipeline for 3B and 8B models - Continuous pretraining (CPT) support - NemotronDiffusionAttention (renamed from FlexDotProductAttention) - Inference support with KV cache and lm-eval integration - HF bridge conversion for Nemotron Diffusion checkpoints - Training documentation and example configs Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

…sion modules Clean up formatting and linting issues in AR-to-DLM conversion, continuous pretraining, evaluation, model inference, attention, and checkpoint conversion components. Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

- Fix ar_to_dlm key recipe path in README - Update data-paths examples to point to DCLM dataset - Fix tokenizer_model to use HF model ID instead of local path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

coderabbitai · 2026-04-02T01:49:31Z

📝 Walkthrough

Walkthrough

Adds comprehensive NemotronDiffusion support to Megatron's diffusion bridge, including training pipeline (continuous pretraining and AR-to-DLM conversion), inference with dual AR/diffusion generation modes, specialized attention and loss computation for diffusion language models, model provider architecture, HuggingFace↔Megatron checkpoint conversion, and supporting documentation and configurations.

Changes

Cohort / File(s)	Summary
Documentation & Configuration `examples/diffusion/recipes/nemotron_diffusion/README.md`, `examples/diffusion/recipes/nemotron_diffusion/conf/_dlm.yaml`, `examples/diffusion/recipes/nemotron_diffusion/conf/cpt_.yaml`, `src/megatron/bridge/diffusion/README.md`	End-to-end workflow documentation and YAML configs for 3B/8B/14B CPT and AR-to-DLM training with batch sizes, parallelism, optimizer, scheduler, checkpoint, and logging settings.
Training Scripts `examples/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py`, `examples/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py`	CLI-driven training entrypoints with configuration factory selection, YAML override merging, and Hydra dotlist argument handling for CPT and diffusion-mode pretraining.
Core Attention & Loss Computation `src/megatron/bridge/diffusion/models/common/nemotron_diffusion_attention.py`, `src/megatron/bridge/diffusion/models/common/dgpt_step.py`, `src/megatron/bridge/diffusion/common/dllm.py`	Specialized `NemotronDiffusionAttention` with RoPE and block-mask support for sbd_block_diff paradigm, `DGPTStep` forward step combining diffusion and autoregressive losses, and masking/block-attention utilities.
Model Provider & Registration `src/megatron/bridge/diffusion/models/nemotron_diffusion/nemotron_diffusion_provider.py`, `src/megatron/bridge/diffusion/models/nemotron_diffusion/__init__.py`, `src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py`, `src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py`	`NemotronDiffusionModelProvider` dataclass with diffusion config (mask token id, loss weights, block size), config builders (`nemotron_diffusion3_pretrain_config`, CPT finetuning functions), and optimizer/scheduler factories.
Inference & Generation `src/megatron/bridge/diffusion/models/nemotron_diffusion/inference_nemotron_diffusion.py`, `examples/diffusion/recipes/nemotron_diffusion/eval_megatron.py`	Dual-mode generation (`generate_ar` greedy/sampling, `generate_dllm` iterative denoising with remasking strategies), tensor-parallel synchronization, lm-eval-harness model registration with latency/NFE logging.
Bridge & Conversion `src/megatron/bridge/diffusion/conversion/nemotron_diffusion/nemotron_diffusion_bridge.py`, `src/megatron/bridge/diffusion/conversion/nemotron_diffusion/__init__.py`, `src/megatron/bridge/diffusion/common/__init__.py`	`NemotronDiffusionBridge` for HuggingFace Mistral3↔Megatron GPT conversion with parameter mappings for language model, vision tower, and multimodal projector components.
Test Coverage `tests/unit_tests/diffusion/.../test_nemotron_diffusion_bridge.py`, `tests/unit_tests/diffusion/.../test_dgpt_step.py`, `tests/unit_tests/diffusion/.../test_inference.py`, `tests/unit_tests/diffusion/.../test_nemotron_diffusion_attention.py`, `tests/unit_tests/diffusion/.../test_nemotron_diffusion_provider.py`	Comprehensive unit tests validating bridge mapping registry/provider, loss computation and masking, attention initialization and inference, inference utilities and generation modes, and model provider configuration and behavior.

Sequence Diagram(s)

sequenceDiagram
    participant User as Training Entry
    participant ConfigBuilder as Config Builder
    participant Bridge as AutoBridge
    participant Step as DGPTStep
    participant Model as NemotronDiffusion
    
    User->>ConfigBuilder: nemotron_diffusion3_pretrain_config()
    ConfigBuilder->>Bridge: AutoBridge.from_hf_pretrained()
    Bridge->>ConfigBuilder: NemotronDiffusionModelProvider
    ConfigBuilder->>User: ConfigContainer
    
    User->>Step: forward_step_func=DGPTStep()
    User->>Model: pretrain(config, forward_step)
    Model->>Step: __call__(state, data_iterator)
    Step->>Step: forward_process_simple_masking()
    Note over Step: Create [noisy | clean] concat
    Step->>Model: forward([noisy | clean])
    Model->>Model: NemotronDiffusionAttention
    Model->>Step: logits
    Step->>Step: Split logits, compute losses
    Step->>Step: combine(dlm_loss, ar_loss)
    Step->>Model: return loss

sequenceDiagram
    participant App as Evaluation
    participant Model as MegatronDLLM
    participant Prompt as Prompt Input
    participant Gen as Generator
    participant Inference as Inference Utils
    
    App->>Model: register_model("megatron_dllm")
    App->>Model: __init__(eval_mode="dllm"|"ar")
    Model->>Inference: set_tp_group(tp_rank_group)
    
    Prompt->>Gen: generate(prompt_ids)
    alt eval_mode == "ar"
        Gen->>Inference: generate_ar(model, prompt)
        Inference->>Model: iterative forward + sample
    else eval_mode == "dllm"
        Gen->>Inference: generate_dllm(model, prompt)
        Inference->>Model: denoise blocks iteratively
        Inference->>Inference: get_transfer_index() select masked positions
        Inference->>Model: update KV cache per block
    end
    Model->>App: generated_tokens

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR description is template-style without filled changelog or test results documentation despite adding 2,666+ lines of unit tests across five test modules for major Nemotron Diffusion feature.	Document test results, convergence validation, performance metrics, and testing configuration in PR description body for reviewer verification.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'Add Support for Nemotron diffusion' directly and clearly summarizes the main purpose of the changeset—adding comprehensive support for Nemotron Diffusion including models, bridges, recipes, training scripts, and related utilities.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (5)

src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py-114-118 (1)

114-118: ⚠️ Potential issue | 🟡 Minor

Docstring/code mismatch for 8B config.

The docstring states "Default: TP=2, MBS=1" but the code sets tensor_model_parallel_size=4.

📝 Proposed fix

 def nemotron_diffusion_8b_finetune_config(
     data_paths=None,
     data_args_path=None,
     hf_path=None,
     peft=None,
 ) -> ConfigContainer:
-    """Return a CPT config for Nemotron-Diffusion 8B. Default: TP=2, MBS=1."""
+    """Return a CPT config for Nemotron-Diffusion 8B. Default: TP=4, MBS=1."""
     return _nemotron_diffusion_cpt_config(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py`
around lines 114 - 118, The docstring for the function in
continuous_pretraining.py claims the default TP=2, MBS=1 but the call to
_nemotron_diffusion_cpt_config sets tensor_model_parallel_size=4; update the
call so tensor_model_parallel_size is 2 (or alternatively update the docstring
to match 4) to keep the docstring and code consistent—look for the return in the
function that calls _nemotron_diffusion_cpt_config and change
tensor_model_parallel_size to 2 if you want to match "Default: TP=2, MBS=1".

src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py-97-102 (1)

97-102: ⚠️ Potential issue | 🟡 Minor

Docstring/code mismatch for 3B config.

The docstring states "Default: TP=1, MBS=2" but the code sets micro_batch_size=1.

📝 Proposed fix

 def nemotron_diffusion_3b_finetune_config(
     data_paths=None,
     data_args_path=None,
     hf_path=None,
     peft=None,
 ) -> ConfigContainer:
-    """Return a CPT config for Nemotron-Diffusion 3B. Default: TP=1, MBS=2."""
+    """Return a CPT config for Nemotron-Diffusion 3B. Default: TP=1, MBS=1."""
     return _nemotron_diffusion_cpt_config(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py`
around lines 97 - 102, The docstring for the 3B config in
continuous_pretraining.py says "Default: TP=1, MBS=2" but the call to
_nemotron_diffusion_cpt_config sets micro_batch_size=1; update the call to
_nemotron_diffusion_cpt_config (the 3B config factory) to set micro_batch_size=2
so the code matches the docstring (or alternatively update the docstring if 1 is
intended) — reference the _nemotron_diffusion_cpt_config invocation in the 3B
config return.

examples/diffusion/recipes/nemotron_diffusion/conf/ar_to_dlm_3b_dlm.yaml-1-3 (1)

1-3: ⚠️ Potential issue | 🟡 Minor

Update the header to the new entrypoint name.

This still points users at pretrain_dlm_3b.py, but the new workflow in this PR uses ar_to_dlm.py. Leaving the old name here will misdirect anyone starting from the config file.

💡 Proposed fix

-# Use with pretrain_dlm_3b.py
+# Use with examples/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/diffusion/recipes/nemotron_diffusion/conf/ar_to_dlm_3b_dlm.yaml`
around lines 1 - 3, The config header incorrectly references the old entrypoint
filename "pretrain_dlm_3b.py"; update that string to the new entrypoint
"ar_to_dlm.py" so users are directed to the current workflow; locate the header
comment containing "Use with pretrain_dlm_3b.py" and change it to "Use with
ar_to_dlm.py" (verify there are no other occurrences of "pretrain_dlm_3b.py" in
the file such as example commands or docs within the same YAML).

src/megatron/bridge/diffusion/common/dllm.py-42-47 (1)

42-47: ⚠️ Potential issue | 🟡 Minor

Keep p_mask consistent with ignored tokens.

Once loss_mask forces a token to stay unmasked, its effective mask probability becomes 0, but the returned p_mask still reports the sampled value. That makes the two outputs disagree for ignored positions.

💡 Proposed fix

     if loss_mask is not None:
         masked_indices[loss_mask == 0] = 0
+        p_mask = p_mask.masked_fill(loss_mask == 0, 0)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/diffusion/common/dllm.py` around lines 42 - 47, The
returned p_mask must be updated to reflect tokens forced unmasked by loss_mask;
after you apply loss_mask to masked_indices (the line masked_indices[loss_mask
== 0] = 0), also zero out or mask p_mask at the same positions so p_mask matches
the effective mask probability for ignored tokens (e.g., set p_mask[loss_mask ==
0] = 0 or p_mask = p_mask * loss_mask, taking care to preserve tensor
dtype/device). This keeps noisy_batch, masked_indices and p_mask consistent
(refer to variables masked_indices, loss_mask, p_mask, noisy_batch,
mask_token_id, input_ids in dllm.py).

tests/unit_tests/diffusion/model/nemotron_diffusion/test_dgpt_step.py-171-171 (1)

171-171: ⚠️ Potential issue | 🟡 Minor

Prefix unused unpacked variables with _.

Ruff detects multiple RUF059 violations in this test module:

Line 171: masked_indices and p_mask are unused

Line 257: num_tokens and report are unused

Line 565: num_tokens and report are unused

Prefix each unused variable with an underscore (e.g., _masked_indices, _num_tokens, _report) to resolve these linting violations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/diffusion/model/nemotron_diffusion/test_dgpt_step.py` at
line 171, The unpacked return values from step._apply_noise and other test calls
include variables that are unused; rename the unused locals by prefixing them
with an underscore to satisfy Ruff RUF059 (e.g., change masked_indices and
p_mask returned from step._apply_noise to _masked_indices and _p_mask, and
rename the unused num_tokens and report variables at the other test locations to
_num_tokens and _report). Locate the unpack in the test function using
step._apply_noise and any other assertions/assignments at the reported spots and
update only the variable names to their underscored equivalents so the tests'
behavior remains unchanged while resolving the lint warnings.

🧹 Nitpick comments (5)

tests/unit_tests/diffusion/model/nemotron_diffusion/test_nemotron_diffusion_attention.py (2)

186-190: Unused variable k_e should be prefixed with underscore.

Static analysis flagged that k_e is unpacked but never used. Prefix with underscore to indicate intentional non-use.

🔧 Proposed fix

-        q_e, k_e = apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=0)
+        q_e, _k_e = apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=0)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@tests/unit_tests/diffusion/model/nemotron_diffusion/test_nemotron_diffusion_attention.py`
around lines 186 - 190, The test unpacks two outputs from apply_rotary_pos_emb
into q_e and k_e but never uses k_e; change the unpacked variable to _k_e (or
prefix it with an underscore) in the test where apply_rotary_pos_emb(q, k, cos,
sin, unsqueeze_dim=0) is called so static analysis recognizes the unused value
is intentional and keeps the rest of the assertions (q_e checks) unchanged.

386-391: Use explicit int | None type annotation.

PEP 484 prohibits implicit Optional. When a parameter defaults to None, the type annotation should explicitly include None.

🔧 Proposed fix

-    def _qkv(self, sq: int = None):
+    def _qkv(self, sq: int | None = None):
         sq = sq or self.SEQ_LEN

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@tests/unit_tests/diffusion/model/nemotron_diffusion/test_nemotron_diffusion_attention.py`
around lines 386 - 391, The _qkv helper declares a parameter defaulting to None
without an explicit None-bearing type; update its signature to explicitly allow
None (e.g., change the parameter annotation in _qkv to "sq: int | None = None"
or "Optional[int] = None") so the type indicates None is permitted, leaving the
body and return unchanged; reference the _qkv method and its sq parameter when
making this edit.

examples/diffusion/recipes/nemotron_diffusion/conf/cpt_3b.yaml (1)

35-39: Inconsistent warmup configuration: lr_warmup_fraction vs lr_warmup_iters.

Setting lr_warmup_fraction: 0.01 suggests a 1% warmup period, but lr_warmup_iters: 0 explicitly disables warmup iterations. This is potentially confusing—if warmup is intended, lr_warmup_iters should be computed from the fraction; if no warmup is intended, lr_warmup_fraction should be 0 or removed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/diffusion/recipes/nemotron_diffusion/conf/cpt_3b.yaml` around lines
35 - 39, The scheduler warmup settings are inconsistent: lr_warmup_fraction is
set to 0.01 while lr_warmup_iters is 0; decide the intended behavior and make
them consistent by either (A) keeping a fractional warmup and computing/setting
lr_warmup_iters accordingly (e.g., set lr_warmup_iters =
int(total_training_iters * lr_warmup_fraction) wherever the scheduler config is
assembled, updating the scheduler config generation code that reads
lr_warmup_fraction and lr_warmup_iters), or (B) disabling warmup by setting
lr_warmup_fraction: 0 (or removing it) and leaving lr_warmup_iters: 0; ensure
the scheduler config (symbols: lr_warmup_fraction, lr_warmup_iters, scheduler)
is updated so only one source of truth is used.

examples/diffusion/recipes/nemotron_diffusion/conf/cpt_8b.yaml (1)

35-39: Same warmup configuration inconsistency as cpt_3b.yaml.

lr_warmup_fraction: 0.01 conflicts with lr_warmup_iters: 0. Consider aligning these across all CPT configs.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/diffusion/recipes/nemotron_diffusion/conf/cpt_8b.yaml` around lines
35 - 39, In the scheduler block, align the warmup settings so they aren't
contradictory: ensure lr_warmup_fraction and lr_warmup_iters in the scheduler
section are consistent (either set lr_warmup_fraction to 0 if lr_warmup_iters
should be 0, or set lr_warmup_iters to the appropriate non-zero value derived
from lr_warmup_fraction), matching the approach used in the other CPT config
(cpt_3b) for consistency across CPT configs; update the lr_warmup_fraction
and/or lr_warmup_iters values accordingly.

examples/diffusion/recipes/nemotron_diffusion/conf/cpt_14b.yaml (1)

35-39: Same warmup configuration inconsistency as other CPT configs.

Consider standardizing the warmup configuration across all CPT YAML files.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/diffusion/recipes/nemotron_diffusion/conf/cpt_14b.yaml` around lines
35 - 39, The scheduler block in this YAML uses inconsistent warmup fields
(lr_warmup_fraction: 0.01 and lr_warmup_iters: 0) compared to other CPT configs;
update the scheduler keys (lr_warmup_fraction and lr_warmup_iters) to the
canonical values used across CPT YAMLs so the warmup behavior is identical,
keeping lr_decay_style and lr_wsd_decay_iters as-is; locate the scheduler block
(symbols: scheduler, lr_warmup_fraction, lr_warmup_iters, lr_decay_style,
lr_wsd_decay_iters) and copy the standard warmup values from any other CPT
config into this file.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py`:
- Around line 39-41: The CLI's --model-size is ignored because main() always
calls nemotron_diffusion3_pretrain_config(...) regardless of args.model_size;
update main() (and any config construction calls around the existing uses at
59-69 and 112-119) to thread args.model_size into config creation by either (a)
mapping values "3B"/"8B"/"14B" to distinct factory functions (e.g.,
nemotron_diffusion3_pretrain_config, nemotron_diffusion8_pretrain_config,
nemotron_diffusion14_pretrain_config) and selecting the correct one based on
args.model_size, or (b) extend nemotron_diffusion3_pretrain_config to accept a
model_size parameter and use that inside the function to build the appropriate
config; ensure every place that previously called
nemotron_diffusion3_pretrain_config now uses the selected factory or passes
args.model_size so the generated run config actually reflects the CLI flag.

In `@examples/diffusion/recipes/nemotron_diffusion/eval_megatron.py`:
- Around line 494-507: The per-rank JSON read-modify-write in _log_nfe (and
similarly _log_latency) races in distributed runs; modify these functions to
avoid concurrent writes by either: 1) gathering nfe/latency values from all
ranks to rank 0 (use the existing distributed backend utilities) and have only
rank 0 perform the read-modify-write to self.nfe_log_path/self.latency_log_path,
or 2) write per-rank files (e.g., append rank id to the filenames) so each rank
writes independently and a separate merge step combines them later; update
_log_nfe and _log_latency to implement one of these two strategies and ensure
you only perform json.load/json.dump on a single writer to prevent corrupted
JSON.
- Around line 469-492: The code currently collapses a batched generation by
taking only tokenizer.batch_decode(...)[0], so generated_tokens, tokenized_out
and generate_until drop all but the first item and misapply per-request stop
logic (e.g., gen_args[0]["until"]) for batch_size>1; update the logic in the
decode/return path (referencing generated_tokens, tokenized_out,
tokenizer.batch_decode and generate_until/gen_args) to decode the entire batch
into a list of strings (one per batch item), preserve None->"" handling per
element, and return that list so downstream per-item until/stop processing and
latency/accounting remain aligned with each input; apply the same change to the
analogous block around lines 543-551.
- Around line 179-180: Validate tensor-parallel (tp) and pipeline-parallel (pp)
values immediately after tp = int(tp) and pp = int(pp): ensure tp > 0 and pp >
0, and ensure world_size is divisible by tp * pp (so DP size = world_size //
(tp*pp) is an integer); if any check fails, raise a clear ValueError (or log and
sys.exit) with a message mentioning tp, pp and world_size before any
process-group construction (the sections that build DP/TP/PP groups around the
global_rank/world_size logic in the subsequent block lines 189-227 and 233-235).
This prevents tp==0 (which causes global_rank % tp crashes) and non-divisible
world sizes that would create incomplete/hanging collectives.
- Around line 608-614: Replace the current dummy-return behavior in the methods
loglikelihood and loglikelihood_rolling: instead of logging a warning and
returning constant zeros, raise a descriptive NotImplementedError explaining
that likelihood evaluation is not supported for diffusion models; update both
functions (loglikelihood and loglikelihood_rolling) to raise that error (include
a message referencing the diffusion model limitation) so callers fail fast
rather than receiving fabricated scores.

In
`@src/megatron/bridge/diffusion/conversion/nemotron_diffusion/nemotron_diffusion_bridge.py`:
- Around line 46-56: The bridge decorator is being evaluated even when
Mistral3ForConditionalGeneration is None, causing an AttributeError during
import; guard the registration by only defining/decorating
NemotronDiffusionBridge when HAS_MISTRAL3 is True: wrap the
MegatronModelBridge.register_bridge(...) decorator and the
NemotronDiffusionBridge class definition inside an if HAS_MISTRAL3: block so the
decorator call (and any use of source.__name__) is skipped when the optional
dependency is unavailable.

In `@src/megatron/bridge/diffusion/models/common/dgpt_step.py`:
- Around line 261-277: The divide_by_masked_tokens flag is threaded into
_create_loss_function_sbd but never consumed, so update
_masked_loss_sbd_block_diff to accept a divide_by_masked_tokens parameter and,
when true, compute the number of masked tokens from the provided loss_mask
(avoid zero by max(count, 1)) and divide the DLM loss term by that count (leave
AR loss behavior unchanged unless desired); ensure the function signature name
_masked_loss_sbd_block_diff includes divide_by_masked_tokens and that any other
factory/usage sites referenced in the diff region (the other create helper
around lines 280-327) are similarly wired to pass the flag through.
- Around line 63-67: DGPTStep currently assumes pipeline parallelism of 1 and
will hit None.clone() when PP>1; add a fail-fast guard in DGPTStep.__call__ (or
at start of get_batch_from_iterator) that checks pipeline model parallel world
size via parallel_state.get_pipeline_model_parallel_world_size() (or equivalent)
and raises NotImplementedError with a clear message like "DGPTStep does not
support pipeline parallelism > 1" when the size > 1; reference
DGPTStep.__call__, get_batch_from_iterator, and _apply_noise to ensure the error
triggers before any cloning or calls to _apply_noise so stages that only
materialize tokens or labels will not access None.
- Around line 294-327: The current validation only checks dlm_loss but the final
objective is loss = dlm_loss * dlm_loss_weight + ar_loss * ar_loss_weight, so
add the same rerun_state_machine.validate_result checks (NaN, Inf, and spiky via
is_unexpectedly_large) on the combined objective tensor (e.g., validate the
computed loss tensor before it is returned/used) using the same
messages/tolerance/fatal flags; locate get_rerun_state_machine(), dlm_loss,
ar_loss, dlm_loss_weight, ar_loss_weight, rerun_state_machine.validate_result
and the loss assignment to perform the validations on that combined loss (keep
existing per-branch semantics: fatal for NaN/Inf, non-fatal for spiky).

In `@src/megatron/bridge/diffusion/models/common/nemotron_diffusion_attention.py`:
- Around line 193-206: The RoPE initialization is currently nested inside
apply_llama4_style_query_key_layer_scaling causing self.rope_embedding_module,
self.beta and self.max_position_embeddings to be undefined when that flag is
false; refactor the constructor so that creation of the rotary module
(Ministral3RotaryEmbedding) and assignment of self.rope_embedding_module and
self.max_position_embeddings always happens (using a suitable hf_text_config or
default), and move only the extra Llama-4 specific assignments (self.beta and
modifying rope_parameters["factor"] / original_max_position_embeddings) into the
apply_llama4_style_query_key_layer_scaling branch; then update any
attention/forward code that multiplies queries by the Llama-4 scale to guard
that multiplication with if self.beta is not None so scaling is applied only
when initialized.

In
`@src/megatron/bridge/diffusion/models/nemotron_diffusion/inference_nemotron_diffusion.py`:
- Around line 237-248: The current token loop using max_new_tokens, next_logits,
temperature and generated only breaks when every batch row hits eos_token_id
simultaneously; change it to maintain a per-sample finished mask (e.g., finished
= torch.zeros(batch_size, dtype=torch.bool, device=logits.device)), update
finished |= (next_token.squeeze(-1) == eos_token_id) each step, use torch.where
to keep next_token for already finished rows (or replace with eos_token_id/pad)
so you do not overwrite completed rows when concatenating into generated, and
break the loop when finished.all(); ensure places referencing next_logits,
torch.multinomial, and generated use the mask so sampling/updating only occurs
for non-finished samples.
- Around line 430-448: The loop that updates KV caches can leave output_logits
None when model_schedule omits the primary model, causing dream-style
multi-block generation to crash at next_logits_context; to fix, ensure
output_logits is always populated by choosing a deterministic source: after
computing _logits in the kv_update_models loop (using _model_forward on each
kv_model), set output_logits to _logits if kv_model is model else if
output_logits is still None set it to _logits (i.e., default to the first
kv_model's logits), or alternatively validate at the start of the function that
model_schedule/_unique_sched_models contains model and raise a clear error;
update references around kv_update_models, _unique_sched_models, model,
output_logits, _model_forward, dream_style, and next_logits_context accordingly.
- Around line 472-510: Add a guard at the start of tp_follower_loop to verify TP
communication was initialized (e.g., that _TP_GROUP and _TP_SRC_GLOBAL_RANK have
been set by set_tp_group or equivalent) and raise a descriptive RuntimeError if
not so broadcasts won't run on the default world group; also add a final else
branch for unknown cmd values that raises a descriptive RuntimeError identifying
the unsupported command code and expected TP command set. Locate these changes
in tp_follower_loop and the command-handling logic where
_TP_GROUP/_TP_SRC_GLOBAL_RANK are referenced and where cmd is matched.

In
`@src/megatron/bridge/diffusion/models/nemotron_diffusion/nemotron_diffusion_provider.py`:
- Around line 40-52: The current provide method mutates
self.transformer_layer_spec to the resolved ModuleSpec which prevents later
calls with different vp_stage from being re-resolved; instead, keep the original
callable and only use a patched spec for the duration of the parent call: store
original = self.transformer_layer_spec, resolve transformer_layer_spec (calling
it with vp_stage if it's callable and not a ModuleSpec), apply the patch (set
submodules...core_attention = NemotronDiffusionAttention) on that resolved spec
(or a shallow copy if needed), set self.transformer_layer_spec to the patched
spec, call result = super().provide(pre_process, post_process, vp_stage), then
restore self.transformer_layer_spec = original and return result; reference
provide, transformer_layer_spec, ModuleSpec, NemotronDiffusionAttention, and
super().provide when making the change.

In `@src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py`:
- Around line 159-164: The helper call to
distributed_fused_adam_with_cosine_annealing_dllm inside
_nemotron_diffusion3_common is not receiving the selected precision, so the
optimizer falls back to its BF16 default and produces wrong
OptimizerConfig.bf16/fp16 flags; update the call sites (including the similar
block around lines 233-281) to accept the precision_config parameter from
_nemotron_diffusion3_common, normalize its string to the repo convention (e.g.,
map variants to "bf16_mixed" or "fp16_mixed" as appropriate), and pass that
normalized precision into distributed_fused_adam_with_cosine_annealing_dllm so
it can derive the correct optimizer flags (bf16/fp16) rather than defaulting to
BF16. Ensure the normalization logic is used consistently wherever optimizer
precision is determined.
- Around line 216-224: The CheckpointConfig currently forces finetune=True
(CheckpointConfig(..., finetune=True,...)), which breaks normal resume semantics
when later loading from checkpoint.load; change the assignment so finetune is
not hardcoded — either accept a parameter or set finetune =
bool(pretrained_checkpoint) (or finetune = pretrained_checkpoint is not None) so
finetune is true only when bootstrapping from an AR pretrained_checkpoint and
false for normal restarts from checkpoint_dir, preserving optimizer/iteration
resume state; update the construction that references CheckpointConfig,
pretrained_checkpoint and checkpoint_dir accordingly.
- Around line 52-53: The function signature currently types model_provider as
NemotronDiffusionModelProvider | None but the code calls model_provider(), so
change the parameter type to a Callable (e.g., Callable[...,
NemotronDiffusionModelProvider] | None) and add an explicit validation at the
start of the function that ensures exactly one of hf_path or model_provider is
provided; if both are None (or both provided) raise a clear ValueError
describing that hf_path or a model_provider callable is required. Update any
references to model_provider usage (the place where model_provider() is invoked)
to rely on the validated callable and keep NemotronDiffusionModelProvider as the
returned/expected type.

---

Minor comments:
In `@examples/diffusion/recipes/nemotron_diffusion/conf/ar_to_dlm_3b_dlm.yaml`:
- Around line 1-3: The config header incorrectly references the old entrypoint
filename "pretrain_dlm_3b.py"; update that string to the new entrypoint
"ar_to_dlm.py" so users are directed to the current workflow; locate the header
comment containing "Use with pretrain_dlm_3b.py" and change it to "Use with
ar_to_dlm.py" (verify there are no other occurrences of "pretrain_dlm_3b.py" in
the file such as example commands or docs within the same YAML).

In `@src/megatron/bridge/diffusion/common/dllm.py`:
- Around line 42-47: The returned p_mask must be updated to reflect tokens
forced unmasked by loss_mask; after you apply loss_mask to masked_indices (the
line masked_indices[loss_mask == 0] = 0), also zero out or mask p_mask at the
same positions so p_mask matches the effective mask probability for ignored
tokens (e.g., set p_mask[loss_mask == 0] = 0 or p_mask = p_mask * loss_mask,
taking care to preserve tensor dtype/device). This keeps noisy_batch,
masked_indices and p_mask consistent (refer to variables masked_indices,
loss_mask, p_mask, noisy_batch, mask_token_id, input_ids in dllm.py).

In
`@src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py`:
- Around line 114-118: The docstring for the function in
continuous_pretraining.py claims the default TP=2, MBS=1 but the call to
_nemotron_diffusion_cpt_config sets tensor_model_parallel_size=4; update the
call so tensor_model_parallel_size is 2 (or alternatively update the docstring
to match 4) to keep the docstring and code consistent—look for the return in the
function that calls _nemotron_diffusion_cpt_config and change
tensor_model_parallel_size to 2 if you want to match "Default: TP=2, MBS=1".
- Around line 97-102: The docstring for the 3B config in
continuous_pretraining.py says "Default: TP=1, MBS=2" but the call to
_nemotron_diffusion_cpt_config sets micro_batch_size=1; update the call to
_nemotron_diffusion_cpt_config (the 3B config factory) to set micro_batch_size=2
so the code matches the docstring (or alternatively update the docstring if 1 is
intended) — reference the _nemotron_diffusion_cpt_config invocation in the 3B
config return.

In `@tests/unit_tests/diffusion/model/nemotron_diffusion/test_dgpt_step.py`:
- Line 171: The unpacked return values from step._apply_noise and other test
calls include variables that are unused; rename the unused locals by prefixing
them with an underscore to satisfy Ruff RUF059 (e.g., change masked_indices and
p_mask returned from step._apply_noise to _masked_indices and _p_mask, and
rename the unused num_tokens and report variables at the other test locations to
_num_tokens and _report). Locate the unpack in the test function using
step._apply_noise and any other assertions/assignments at the reported spots and
update only the variable names to their underscored equivalents so the tests'
behavior remains unchanged while resolving the lint warnings.

---

Nitpick comments:
In `@examples/diffusion/recipes/nemotron_diffusion/conf/cpt_14b.yaml`:
- Around line 35-39: The scheduler block in this YAML uses inconsistent warmup
fields (lr_warmup_fraction: 0.01 and lr_warmup_iters: 0) compared to other CPT
configs; update the scheduler keys (lr_warmup_fraction and lr_warmup_iters) to
the canonical values used across CPT YAMLs so the warmup behavior is identical,
keeping lr_decay_style and lr_wsd_decay_iters as-is; locate the scheduler block
(symbols: scheduler, lr_warmup_fraction, lr_warmup_iters, lr_decay_style,
lr_wsd_decay_iters) and copy the standard warmup values from any other CPT
config into this file.

In `@examples/diffusion/recipes/nemotron_diffusion/conf/cpt_3b.yaml`:
- Around line 35-39: The scheduler warmup settings are inconsistent:
lr_warmup_fraction is set to 0.01 while lr_warmup_iters is 0; decide the
intended behavior and make them consistent by either (A) keeping a fractional
warmup and computing/setting lr_warmup_iters accordingly (e.g., set
lr_warmup_iters = int(total_training_iters * lr_warmup_fraction) wherever the
scheduler config is assembled, updating the scheduler config generation code
that reads lr_warmup_fraction and lr_warmup_iters), or (B) disabling warmup by
setting lr_warmup_fraction: 0 (or removing it) and leaving lr_warmup_iters: 0;
ensure the scheduler config (symbols: lr_warmup_fraction, lr_warmup_iters,
scheduler) is updated so only one source of truth is used.

In `@examples/diffusion/recipes/nemotron_diffusion/conf/cpt_8b.yaml`:
- Around line 35-39: In the scheduler block, align the warmup settings so they
aren't contradictory: ensure lr_warmup_fraction and lr_warmup_iters in the
scheduler section are consistent (either set lr_warmup_fraction to 0 if
lr_warmup_iters should be 0, or set lr_warmup_iters to the appropriate non-zero
value derived from lr_warmup_fraction), matching the approach used in the other
CPT config (cpt_3b) for consistency across CPT configs; update the
lr_warmup_fraction and/or lr_warmup_iters values accordingly.

In
`@tests/unit_tests/diffusion/model/nemotron_diffusion/test_nemotron_diffusion_attention.py`:
- Around line 186-190: The test unpacks two outputs from apply_rotary_pos_emb
into q_e and k_e but never uses k_e; change the unpacked variable to _k_e (or
prefix it with an underscore) in the test where apply_rotary_pos_emb(q, k, cos,
sin, unsqueeze_dim=0) is called so static analysis recognizes the unused value
is intentional and keeps the rest of the assertions (q_e checks) unchanged.
- Around line 386-391: The _qkv helper declares a parameter defaulting to None
without an explicit None-bearing type; update its signature to explicitly allow
None (e.g., change the parameter annotation in _qkv to "sq: int | None = None"
or "Optional[int] = None") so the type indicates None is permitted, leaving the
body and return unchanged; reference the _qkv method and its sq parameter when
making this edit.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 25508144-bb9b-494f-8eb8-37ed0de852a9

📥 Commits

Reviewing files that changed from the base of the PR and between dc70187 and d60258a.

📒 Files selected for processing (32)

examples/diffusion/recipes/nemotron_diffusion/README.md
examples/diffusion/recipes/nemotron_diffusion/__init__.py
examples/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py
examples/diffusion/recipes/nemotron_diffusion/conf/ar_to_dlm_14b_dlm.yaml
examples/diffusion/recipes/nemotron_diffusion/conf/ar_to_dlm_3b_dlm.yaml
examples/diffusion/recipes/nemotron_diffusion/conf/ar_to_dlm_8b_dlm.yaml
examples/diffusion/recipes/nemotron_diffusion/conf/cpt_14b.yaml
examples/diffusion/recipes/nemotron_diffusion/conf/cpt_3b.yaml
examples/diffusion/recipes/nemotron_diffusion/conf/cpt_8b.yaml
examples/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py
examples/diffusion/recipes/nemotron_diffusion/eval_megatron.py
src/megatron/bridge/diffusion/README.md
src/megatron/bridge/diffusion/common/__init__.py
src/megatron/bridge/diffusion/common/dllm.py
src/megatron/bridge/diffusion/conversion/nemotron_diffusion/__init__.py
src/megatron/bridge/diffusion/conversion/nemotron_diffusion/nemotron_diffusion_bridge.py
src/megatron/bridge/diffusion/models/common/dgpt_step.py
src/megatron/bridge/diffusion/models/common/nemotron_diffusion_attention.py
src/megatron/bridge/diffusion/models/nemotron_diffusion/__init__.py
src/megatron/bridge/diffusion/models/nemotron_diffusion/inference_nemotron_diffusion.py
src/megatron/bridge/diffusion/models/nemotron_diffusion/nemotron_diffusion_provider.py
src/megatron/bridge/diffusion/recipes/nemotron_diffusion/__init__.py
src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py
src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py
tests/unit_tests/diffusion/model/nemotron_diffusion/__init__.py
tests/unit_tests/diffusion/model/nemotron_diffusion/conversion/__init__.py
tests/unit_tests/diffusion/model/nemotron_diffusion/conversion/test_nemotron_diffusion_bridge.py
tests/unit_tests/diffusion/model/nemotron_diffusion/test_dgpt_step.py
tests/unit_tests/diffusion/model/nemotron_diffusion/test_inference.py
tests/unit_tests/diffusion/model/nemotron_diffusion/test_nemotron_diffusion_attention.py
tests/unit_tests/diffusion/recipes/nemotron_diffusion/__init__.py
tests/unit_tests/diffusion/recipes/nemotron_diffusion/test_nemotron_diffusion_provider.py

examples/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py

examples/diffusion/recipes/nemotron_diffusion/eval_megatron.py

coderabbitai · 2026-04-02T01:49:35Z

src/megatron/bridge/diffusion/models/nemotron_diffusion/inference_nemotron_diffusion.py

+def tp_follower_loop(model):
+    """Blocking loop for TP-non-rank-0 processes.
+
+    Waits for commands from the TP-rank-0 process and mirrors all model
+    operations (set_inference_mode, set_inference_params, clear_kv_cache,
+    model forward) so Megatron TP communication stays in sync.
+    """
+    while True:
+        cmd = torch.zeros(1, dtype=torch.long, device="cuda")
+        torch.distributed.broadcast(cmd, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP)
+        cmd = cmd.item()
+
+        if cmd == _CMD_STOP:
+            break
+        elif cmd == _CMD_FORWARD:
+            input_ids = _broadcast_tensor(
+                torch.zeros(1, 1, dtype=torch.long, device="cuda"),
+                _TP_SRC_GLOBAL_RANK,
+                _TP_GROUP,
+            )
+            seq_len = input_ids.shape[1]
+            position_ids = torch.arange(seq_len, device=input_ids.device).unsqueeze(0).expand_as(input_ids)
+            model(input_ids=input_ids, position_ids=position_ids, attention_mask=None)
+        elif cmd == _CMD_SET_INF_MODE_ON:
+            for attn in _get_core_attentions(model):
+                attn.set_inference_mode(True)
+        elif cmd == _CMD_SET_INF_MODE_OFF:
+            for attn in _get_core_attentions(model):
+                attn.set_inference_mode(False)
+        elif cmd == _CMD_SET_PARAMS:
+            extra = torch.zeros(2, dtype=torch.long, device="cuda")
+            torch.distributed.broadcast(extra, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP)
+            causal = bool(extra[0].item())
+            cache_enabled = bool(extra[1].item())
+            for attn in _get_core_attentions(model):
+                attn.set_inference_params(causal, cache_enabled)
+        elif cmd == _CMD_CLEAR_CACHE:
+            for attn in _get_core_attentions(model):
+                attn.clear_kv_cache()


⚠️ Potential issue | 🟠 Major

Guard tp_follower_loop() before broadcasting on the wrong group.

If a follower enters this loop before set_tp_group(), every broadcast here runs with group=None, i.e. the default world group. In a TP job that can hang the whole process set instead of failing fast. Add an explicit setup check at the top and raise a descriptive RuntimeError (same for unknown command codes). Based on learnings, when a feature is not supported, raise an explicit error with a descriptive message that identifies the unsupported feature and the expected behavior.

Suggested fix

def tp_follower_loop(model): """Blocking loop for TP-non-rank-0 processes. @@ Waits for commands from the TP-rank-0 process and mirrors all model operations (set_inference_mode, set_inference_params, clear_kv_cache, model forward) so Megatron TP communication stays in sync. """ + if _TP_GROUP is None: + raise RuntimeError("set_tp_group() must be called before tp_follower_loop().") + while True: cmd = torch.zeros(1, dtype=torch.long, device="cuda") torch.distributed.broadcast(cmd, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def tp_follower_loop(model):

"""Blocking loop for TP-non-rank-0 processes.

Waits for commands from the TP-rank-0 process and mirrors all model

operations (set_inference_mode, set_inference_params, clear_kv_cache,

model forward) so Megatron TP communication stays in sync.

"""

while True:

cmd = torch.zeros(1, dtype=torch.long, device="cuda")

torch.distributed.broadcast(cmd, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP)

cmd = cmd.item()

if cmd == _CMD_STOP:

break

elif cmd == _CMD_FORWARD:

input_ids = _broadcast_tensor(

torch.zeros(1, 1, dtype=torch.long, device="cuda"),

_TP_SRC_GLOBAL_RANK,

_TP_GROUP,

)

seq_len = input_ids.shape[1]

position_ids = torch.arange(seq_len, device=input_ids.device).unsqueeze(0).expand_as(input_ids)

model(input_ids=input_ids, position_ids=position_ids, attention_mask=None)

elif cmd == _CMD_SET_INF_MODE_ON:

for attn in _get_core_attentions(model):

attn.set_inference_mode(True)

elif cmd == _CMD_SET_INF_MODE_OFF:

for attn in _get_core_attentions(model):

attn.set_inference_mode(False)

elif cmd == _CMD_SET_PARAMS:

extra = torch.zeros(2, dtype=torch.long, device="cuda")

torch.distributed.broadcast(extra, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP)

causal = bool(extra[0].item())

cache_enabled = bool(extra[1].item())

for attn in _get_core_attentions(model):

attn.set_inference_params(causal, cache_enabled)

elif cmd == _CMD_CLEAR_CACHE:

for attn in _get_core_attentions(model):

attn.clear_kv_cache()

def tp_follower_loop(model):

"""Blocking loop for TP-non-rank-0 processes.

Waits for commands from the TP-rank-0 process and mirrors all model

operations (set_inference_mode, set_inference_params, clear_kv_cache,

model forward) so Megatron TP communication stays in sync.

"""

if _TP_GROUP is None:

raise RuntimeError("set_tp_group() must be called before tp_follower_loop().")

while True:

cmd = torch.zeros(1, dtype=torch.long, device="cuda")

torch.distributed.broadcast(cmd, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP)

cmd = cmd.item()

if cmd == _CMD_STOP:

break

elif cmd == _CMD_FORWARD:

input_ids = _broadcast_tensor(

torch.zeros(1, 1, dtype=torch.long, device="cuda"),

_TP_SRC_GLOBAL_RANK,

_TP_GROUP,

)

seq_len = input_ids.shape[1]

position_ids = torch.arange(seq_len, device=input_ids.device).unsqueeze(0).expand_as(input_ids)

model(input_ids=input_ids, position_ids=position_ids, attention_mask=None)

elif cmd == _CMD_SET_INF_MODE_ON:

for attn in _get_core_attentions(model):

attn.set_inference_mode(True)

elif cmd == _CMD_SET_INF_MODE_OFF:

for attn in _get_core_attentions(model):

attn.set_inference_mode(False)

elif cmd == _CMD_SET_PARAMS:

extra = torch.zeros(2, dtype=torch.long, device="cuda")

torch.distributed.broadcast(extra, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP)

causal = bool(extra[0].item())

cache_enabled = bool(extra[1].item())

for attn in _get_core_attentions(model):

attn.set_inference_params(causal, cache_enabled)

elif cmd == _CMD_CLEAR_CACHE:

for attn in _get_core_attentions(model):

attn.clear_kv_cache()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/diffusion/models/nemotron_diffusion/inference_nemotron_diffusion.py` around lines 472 - 510, Add a guard at the start of tp_follower_loop to verify TP communication was initialized (e.g., that _TP_GROUP and _TP_SRC_GLOBAL_RANK have been set by set_tp_group or equivalent) and raise a descriptive RuntimeError if not so broadcasts won't run on the default world group; also add a final else branch for unknown cmd values that raises a descriptive RuntimeError identifying the unsupported command code and expected TP command set. Locate these changes in tp_follower_loop and the command-handling logic where _TP_GROUP/_TP_SRC_GLOBAL_RANK are referenced and where cmd is matched.

src/megatron/bridge/diffusion/models/nemotron_diffusion/nemotron_diffusion_provider.py

coderabbitai · 2026-04-02T01:49:35Z

src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py

+    model_provider: NemotronDiffusionModelProvider | None = None,
+    hf_path: str | None = None,


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and read the relevant sections head -60 src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py | cat -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2652

🏁 Script executed:

# Also check the "Also applies to" section sed -n '130,150p' src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py | cat -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1119

🏁 Script executed:

# Check the imports at the top of the file head -20 src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py | cat -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 942

🏁 Script executed:

# Search for how model_provider is used in the file rg "model_provider" src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py -A 2 -B 1

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 642

🏁 Script executed:

# Get full context of the function to see all usages of model_provider sed -n '50,200p' src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py | cat -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 8084

🏁 Script executed:

# Check if there are any other usages of model_provider() calls rg "model_provider\(\)" src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 107

🏁 Script executed:

# Check the full function signature and docstring sed -n '51,150p' src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py | cat -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5828

🏁 Script executed:

# Search for other function definitions with model_provider parameter rg "def.*model_provider" src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Check for any other parameter definitions with model_provider in the file rg "model_provider.*:" src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py -B 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 355

🏁 Script executed:

# Get full file line count to understand scope wc -l src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 144

🏁 Script executed:

# Check what's at lines 134-142 in the actual file sed -n '130,145p' src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py | cat -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 850

🏁 Script executed:

# Search for any other model_provider() calls in the entire file rg "model_provider\(\)" src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py -B 5 -A 1

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 274

Change model_provider parameter type to Callable and add explicit None check.

The parameter is annotated as NemotronDiffusionModelProvider | None, but the code calls model_provider() at line 92. This mismatch causes TypeError if callers follow the signature and pass an instance. Additionally, there is no None check before the call; both hf_path and model_provider can be None simultaneously, leading to a cryptic error. Add explicit validation to fail fast with a descriptive error message.

Suggested fix

import os -from typing import Optional +from typing import Callable, Optional @@ def _nemotron_diffusion3_common( - model_provider: NemotronDiffusionModelProvider | None = None, + model_provider: Callable[[], NemotronDiffusionModelProvider] | None = None, @@ if hf_path is not None: bridge = AutoBridge.from_hf_pretrained(hf_path) model_cfg = bridge.to_megatron_provider( load_weights=bool(pretrained_checkpoint), hf_path=pretrained_checkpoint, ) else: + if model_provider is None: + raise ValueError("Either `hf_path` or `model_provider` must be provided.") model_cfg = model_provider()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py` around lines 52 - 53, The function signature currently types model_provider as NemotronDiffusionModelProvider | None but the code calls model_provider(), so change the parameter type to a Callable (e.g., Callable[..., NemotronDiffusionModelProvider] | None) and add an explicit validation at the start of the function that ensures exactly one of hf_path or model_provider is provided; if both are None (or both provided) raise a clear ValueError describing that hf_path or a model_provider callable is required. Update any references to model_provider usage (the place where model_provider() is invoked) to rely on the validated callable and keep NemotronDiffusionModelProvider as the returned/expected type.

coderabbitai · 2026-04-02T01:49:36Z

src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py

+    opt_cfg, scheduler_cfg = distributed_fused_adam_with_cosine_annealing_dllm(
+        lr_warmup_iters=lr_warmup_iters,
+        lr_decay_iters=lr_decay_iters,
+        max_lr=lr,
+        min_lr=min_lr,
+    )


⚠️ Potential issue | 🟠 Major

The optimizer precision is disconnected from precision_config.

_nemotron_diffusion3_common() accepts precision_config, but this helper call never forwards it, so distributed_fused_adam_with_cosine_annealing_dllm() always falls back to its BF16 default. That means non-BF16 recipes will still set OptimizerConfig.bf16/fp16 incorrectly, and the helper’s hyphenated literals do not match the repo’s "bf16_mixed" spelling used elsewhere in this file. Please thread the selected precision through and normalize the string before deriving the optimizer flags.

Also applies to: 233-281

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py` around lines 159 - 164, The helper call to distributed_fused_adam_with_cosine_annealing_dllm inside _nemotron_diffusion3_common is not receiving the selected precision, so the optimizer falls back to its BF16 default and produces wrong OptimizerConfig.bf16/fp16 flags; update the call sites (including the similar block around lines 233-281) to accept the precision_config parameter from _nemotron_diffusion3_common, normalize its string to the repo convention (e.g., map variants to "bf16_mixed" or "fp16_mixed" as appropriate), and pass that normalized precision into distributed_fused_adam_with_cosine_annealing_dllm so it can derive the correct optimizer flags (bf16/fp16) rather than defaulting to BF16. Ensure the normalization logic is used consistently wherever optimizer precision is determined.

src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py

cuichenx

Couple of overall comments

please run lint on all files
might be worth it to add a functional test in addition to the unit tests
Please also see comments below

cuichenx · 2026-04-02T06:10:04Z

src/megatron/bridge/diffusion/conversion/nemotron_diffusion/nemotron_diffusion_bridge.py

+    HAS_MISTRAL3 = False
+
+
+@MegatronModelBridge.register_bridge(source=Mistral3ForConditionalGeneration, target=GPTModel)


will there be any issues in non-diffusion ministral 3 workloads if this bridge is imported instead?

thanks for catching this. I think i need to ask for the name of the model in HF.

cuichenx · 2026-04-02T06:11:33Z

examples/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py

+Select model size with --model-size {3b,8b,14b} (default: 14b).
+Use --hf-path to override the HuggingFace model ID or local model path.
+Configuration is overridden via YAML and CLI in the same way as pretrain_ministral3_14b.py.
+"""


add commands for example usage here, similar to other examples

cuichenx · 2026-04-02T06:14:54Z

examples/diffusion/recipes/nemotron_diffusion/conf/ar_to_dlm_14b_dlm.yaml

we generally don't want to have these yaml config file in the repo. Can we include all the hyperparameters in the recipe function? then if anything else is needed, use hydra override in the example command

sure, makes sense. I'll remove them. Btw, other diffusion models like wan and flux seems to have their yaml file in the repo: https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/diffusion/recipes/wan/conf

yaoyu-33 · 2026-04-02T16:54:00Z

/claude review

src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py

src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py

examples/diffusion/recipes/nemotron_diffusion/README.md

claude · 2026-04-02T16:56:04Z

src/megatron/bridge/diffusion/models/nemotron_diffusion/inference_nemotron_diffusion.py

+        # --- Denoise the block iteratively ---
+        for i in range(steps_per_block):
+            step_model = _step_models[i] if _step_models is not None else model
+            print(f"Denoising step {i} of {steps_per_block} using {type(step_model).__name__}")


This looks like a debug print that was left in. It will fire on every denoising step for every block during generation, which is quite noisy. Consider removing it or changing to logger.debug(...).

Suggested change

print(f"Denoising step {i} of {steps_per_block} using {type(step_model).__name__}")

examples/diffusion/recipes/nemotron_diffusion/README.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

- Add TODO comment to check for dedicated HF class for NemotronDiffusion bridge - Remove unused divide_by_masked_tokens parameter from dgpt_step and provider - Fix AR loss validation: extend NaN/Inf/spiky checks to cover ar_loss, not just dlm_loss - Fix RoPE init: always initialize rope_embedding_module unconditionally; guard Llama-4 query scaling with `if self.beta is not None` in both forward paths - Fix VP parallelism bug in provider: restore transformer_layer_spec callable after each provide() call so subsequent vp_stage invocations resolve correctly - Remove model_schedule from generate_dllm (unused, not wired up in CLI) - Add usage examples to ar_to_dlm.py and continuous_pretraining.py docstrings Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Sajad Norouzi <sajad.n@gmail.com>

Scripts referenced in README but missing from the PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

Ministral3Model wraps GPTModel under .language_model, so _unwrap() needs to unwrap it to reach .decoder.layers for NemotronDiffusionAttention access. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

abhinavg4 and others added 7 commits April 1, 2026 18:36

Add unit tests for Nemotron Diffusion model and recipes

2602b76

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

Fix tokenizer paths to use HF model IDs across all yaml configs

a5d7730

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

Add 14B config files for CPT and AR-to-DLM training

bbf3a50

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

sajadn force-pushed the nemotron-diffusion branch from d60258a to bbf3a50 Compare April 2, 2026 01:37

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

cuichenx reviewed Apr 2, 2026

View reviewed changes

claude bot reviewed Apr 2, 2026

View reviewed changes

src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py Outdated Show resolved Hide resolved

claude bot reviewed Apr 2, 2026

View reviewed changes

src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py Outdated Show resolved Hide resolved

claude bot reviewed Apr 2, 2026

View reviewed changes

examples/diffusion/recipes/nemotron_diffusion/README.md Show resolved Hide resolved

claude bot reviewed Apr 2, 2026

View reviewed changes

examples/diffusion/recipes/nemotron_diffusion/README.md Outdated Show resolved Hide resolved

sajadn and others added 2 commits April 2, 2026 10:00

Remove eval_megatron.py

6a229fb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

sajadn changed the title ~~Nemotron diffusion~~ Add Support for Nemotron diffusion Apr 2, 2026

sajadn and others added 4 commits April 2, 2026 13:06

add 14b configs to readme.

1ee7757

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Sajad Norouzi <sajad.n@gmail.com>

Nit: fix doc string issues.

5bcd6a6

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Sajad Norouzi <sajad.n@gmail.com>

Add inference_nemotron.py and convert_checkpoints.py scripts

6d3c0c9

Scripts referenced in README but missing from the PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

yaoyu-33 added the needs-review PR is ready for code review and waiting on a reviewer label Apr 2, 2026

Update copyright year to 2026 across Nemotron Diffusion modules

4ccb9ab

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>

		model_provider: NemotronDiffusionModelProvider \| None = None,
		hf_path: str \| None = None,

		HAS_MISTRAL3 = False


		@MegatronModelBridge.register_bridge(source=Mistral3ForConditionalGeneration, target=GPTModel)

Conversation

sajadn commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Apr 2, 2026

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cuichenx left a comment

Choose a reason for hiding this comment

Uh oh!

cuichenx Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

sajadn Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

cuichenx Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

sajadn Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

cuichenx Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

sajadn Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sajadn commented Apr 2, 2026 •

edited

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading