Add Support for Nemotron diffusion#3105
Conversation
…e model utilities - Updated README to include DMinistral and its features. - Added new modules for DMinistral3 model conversion and utilities, including `DMinistral3Bridge`, `DMinistral3ModelProvider`, and `FlexDotProductAttention`. - Introduced `dgpt_step.py` for training steps specific to diffusion language models. - Implemented masking and attention mechanisms tailored for diffusion models in `dllm.py` and `flex_dot_product_attention.py`. - Established directory structure for DMinistral components and updated conversion bridges accordingly. Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>
Introduces the full Nemotron Diffusion recipe including: - AR-to-DLM conversion pipeline for 3B and 8B models - Continuous pretraining (CPT) support - NemotronDiffusionAttention (renamed from FlexDotProductAttention) - Inference support with KV cache and lm-eval integration - HF bridge conversion for Nemotron Diffusion checkpoints - Training documentation and example configs Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>
…sion modules Clean up formatting and linting issues in AR-to-DLM conversion, continuous pretraining, evaluation, model inference, attention, and checkpoint conversion components. Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>
- Fix ar_to_dlm key recipe path in README - Update data-paths examples to point to DCLM dataset - Fix tokenizer_model to use HF model ID instead of local path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>
d60258a to
bbf3a50
Compare
📝 WalkthroughWalkthroughAdds comprehensive NemotronDiffusion support to Megatron's diffusion bridge, including training pipeline (continuous pretraining and AR-to-DLM conversion), inference with dual AR/diffusion generation modes, specialized attention and loss computation for diffusion language models, model provider architecture, HuggingFace↔Megatron checkpoint conversion, and supporting documentation and configurations. Changes
Sequence Diagram(s)sequenceDiagram
participant User as Training Entry
participant ConfigBuilder as Config Builder
participant Bridge as AutoBridge
participant Step as DGPTStep
participant Model as NemotronDiffusion
User->>ConfigBuilder: nemotron_diffusion3_pretrain_config()
ConfigBuilder->>Bridge: AutoBridge.from_hf_pretrained()
Bridge->>ConfigBuilder: NemotronDiffusionModelProvider
ConfigBuilder->>User: ConfigContainer
User->>Step: forward_step_func=DGPTStep()
User->>Model: pretrain(config, forward_step)
Model->>Step: __call__(state, data_iterator)
Step->>Step: forward_process_simple_masking()
Note over Step: Create [noisy | clean] concat
Step->>Model: forward([noisy | clean])
Model->>Model: NemotronDiffusionAttention
Model->>Step: logits
Step->>Step: Split logits, compute losses
Step->>Step: combine(dlm_loss, ar_loss)
Step->>Model: return loss
sequenceDiagram
participant App as Evaluation
participant Model as MegatronDLLM
participant Prompt as Prompt Input
participant Gen as Generator
participant Inference as Inference Utils
App->>Model: register_model("megatron_dllm")
App->>Model: __init__(eval_mode="dllm"|"ar")
Model->>Inference: set_tp_group(tp_rank_group)
Prompt->>Gen: generate(prompt_ids)
alt eval_mode == "ar"
Gen->>Inference: generate_ar(model, prompt)
Inference->>Model: iterative forward + sample
else eval_mode == "dllm"
Gen->>Inference: generate_dllm(model, prompt)
Inference->>Model: denoise blocks iteratively
Inference->>Inference: get_transfer_index() select masked positions
Inference->>Model: update KV cache per block
end
Model->>App: generated_tokens
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 17
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (5)
src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py-114-118 (1)
114-118:⚠️ Potential issue | 🟡 MinorDocstring/code mismatch for 8B config.
The docstring states "Default: TP=2, MBS=1" but the code sets
tensor_model_parallel_size=4.📝 Proposed fix
def nemotron_diffusion_8b_finetune_config( data_paths=None, data_args_path=None, hf_path=None, peft=None, ) -> ConfigContainer: - """Return a CPT config for Nemotron-Diffusion 8B. Default: TP=2, MBS=1.""" + """Return a CPT config for Nemotron-Diffusion 8B. Default: TP=4, MBS=1.""" return _nemotron_diffusion_cpt_config(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py` around lines 114 - 118, The docstring for the function in continuous_pretraining.py claims the default TP=2, MBS=1 but the call to _nemotron_diffusion_cpt_config sets tensor_model_parallel_size=4; update the call so tensor_model_parallel_size is 2 (or alternatively update the docstring to match 4) to keep the docstring and code consistent—look for the return in the function that calls _nemotron_diffusion_cpt_config and change tensor_model_parallel_size to 2 if you want to match "Default: TP=2, MBS=1".src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py-97-102 (1)
97-102:⚠️ Potential issue | 🟡 MinorDocstring/code mismatch for 3B config.
The docstring states "Default: TP=1, MBS=2" but the code sets
micro_batch_size=1.📝 Proposed fix
def nemotron_diffusion_3b_finetune_config( data_paths=None, data_args_path=None, hf_path=None, peft=None, ) -> ConfigContainer: - """Return a CPT config for Nemotron-Diffusion 3B. Default: TP=1, MBS=2.""" + """Return a CPT config for Nemotron-Diffusion 3B. Default: TP=1, MBS=1.""" return _nemotron_diffusion_cpt_config(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py` around lines 97 - 102, The docstring for the 3B config in continuous_pretraining.py says "Default: TP=1, MBS=2" but the call to _nemotron_diffusion_cpt_config sets micro_batch_size=1; update the call to _nemotron_diffusion_cpt_config (the 3B config factory) to set micro_batch_size=2 so the code matches the docstring (or alternatively update the docstring if 1 is intended) — reference the _nemotron_diffusion_cpt_config invocation in the 3B config return.examples/diffusion/recipes/nemotron_diffusion/conf/ar_to_dlm_3b_dlm.yaml-1-3 (1)
1-3:⚠️ Potential issue | 🟡 MinorUpdate the header to the new entrypoint name.
This still points users at
pretrain_dlm_3b.py, but the new workflow in this PR usesar_to_dlm.py. Leaving the old name here will misdirect anyone starting from the config file.💡 Proposed fix
-# Use with pretrain_dlm_3b.py +# Use with examples/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/diffusion/recipes/nemotron_diffusion/conf/ar_to_dlm_3b_dlm.yaml` around lines 1 - 3, The config header incorrectly references the old entrypoint filename "pretrain_dlm_3b.py"; update that string to the new entrypoint "ar_to_dlm.py" so users are directed to the current workflow; locate the header comment containing "Use with pretrain_dlm_3b.py" and change it to "Use with ar_to_dlm.py" (verify there are no other occurrences of "pretrain_dlm_3b.py" in the file such as example commands or docs within the same YAML).src/megatron/bridge/diffusion/common/dllm.py-42-47 (1)
42-47:⚠️ Potential issue | 🟡 MinorKeep
p_maskconsistent with ignored tokens.Once
loss_maskforces a token to stay unmasked, its effective mask probability becomes0, but the returnedp_maskstill reports the sampled value. That makes the two outputs disagree for ignored positions.💡 Proposed fix
if loss_mask is not None: masked_indices[loss_mask == 0] = 0 + p_mask = p_mask.masked_fill(loss_mask == 0, 0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/diffusion/common/dllm.py` around lines 42 - 47, The returned p_mask must be updated to reflect tokens forced unmasked by loss_mask; after you apply loss_mask to masked_indices (the line masked_indices[loss_mask == 0] = 0), also zero out or mask p_mask at the same positions so p_mask matches the effective mask probability for ignored tokens (e.g., set p_mask[loss_mask == 0] = 0 or p_mask = p_mask * loss_mask, taking care to preserve tensor dtype/device). This keeps noisy_batch, masked_indices and p_mask consistent (refer to variables masked_indices, loss_mask, p_mask, noisy_batch, mask_token_id, input_ids in dllm.py).tests/unit_tests/diffusion/model/nemotron_diffusion/test_dgpt_step.py-171-171 (1)
171-171:⚠️ Potential issue | 🟡 MinorPrefix unused unpacked variables with
_.Ruff detects multiple RUF059 violations in this test module:
- Line 171:
masked_indicesandp_maskare unused- Line 257:
num_tokensandreportare unused- Line 565:
num_tokensandreportare unusedPrefix each unused variable with an underscore (e.g.,
_masked_indices,_num_tokens,_report) to resolve these linting violations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/diffusion/model/nemotron_diffusion/test_dgpt_step.py` at line 171, The unpacked return values from step._apply_noise and other test calls include variables that are unused; rename the unused locals by prefixing them with an underscore to satisfy Ruff RUF059 (e.g., change masked_indices and p_mask returned from step._apply_noise to _masked_indices and _p_mask, and rename the unused num_tokens and report variables at the other test locations to _num_tokens and _report). Locate the unpack in the test function using step._apply_noise and any other assertions/assignments at the reported spots and update only the variable names to their underscored equivalents so the tests' behavior remains unchanged while resolving the lint warnings.
🧹 Nitpick comments (5)
tests/unit_tests/diffusion/model/nemotron_diffusion/test_nemotron_diffusion_attention.py (2)
186-190: Unused variablek_eshould be prefixed with underscore.Static analysis flagged that
k_eis unpacked but never used. Prefix with underscore to indicate intentional non-use.🔧 Proposed fix
- q_e, k_e = apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=0) + q_e, _k_e = apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/diffusion/model/nemotron_diffusion/test_nemotron_diffusion_attention.py` around lines 186 - 190, The test unpacks two outputs from apply_rotary_pos_emb into q_e and k_e but never uses k_e; change the unpacked variable to _k_e (or prefix it with an underscore) in the test where apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=0) is called so static analysis recognizes the unused value is intentional and keeps the rest of the assertions (q_e checks) unchanged.
386-391: Use explicitint | Nonetype annotation.PEP 484 prohibits implicit
Optional. When a parameter defaults toNone, the type annotation should explicitly includeNone.🔧 Proposed fix
- def _qkv(self, sq: int = None): + def _qkv(self, sq: int | None = None): sq = sq or self.SEQ_LEN🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/diffusion/model/nemotron_diffusion/test_nemotron_diffusion_attention.py` around lines 386 - 391, The _qkv helper declares a parameter defaulting to None without an explicit None-bearing type; update its signature to explicitly allow None (e.g., change the parameter annotation in _qkv to "sq: int | None = None" or "Optional[int] = None") so the type indicates None is permitted, leaving the body and return unchanged; reference the _qkv method and its sq parameter when making this edit.examples/diffusion/recipes/nemotron_diffusion/conf/cpt_3b.yaml (1)
35-39: Inconsistent warmup configuration:lr_warmup_fractionvslr_warmup_iters.Setting
lr_warmup_fraction: 0.01suggests a 1% warmup period, butlr_warmup_iters: 0explicitly disables warmup iterations. This is potentially confusing—if warmup is intended,lr_warmup_itersshould be computed from the fraction; if no warmup is intended,lr_warmup_fractionshould be 0 or removed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/diffusion/recipes/nemotron_diffusion/conf/cpt_3b.yaml` around lines 35 - 39, The scheduler warmup settings are inconsistent: lr_warmup_fraction is set to 0.01 while lr_warmup_iters is 0; decide the intended behavior and make them consistent by either (A) keeping a fractional warmup and computing/setting lr_warmup_iters accordingly (e.g., set lr_warmup_iters = int(total_training_iters * lr_warmup_fraction) wherever the scheduler config is assembled, updating the scheduler config generation code that reads lr_warmup_fraction and lr_warmup_iters), or (B) disabling warmup by setting lr_warmup_fraction: 0 (or removing it) and leaving lr_warmup_iters: 0; ensure the scheduler config (symbols: lr_warmup_fraction, lr_warmup_iters, scheduler) is updated so only one source of truth is used.examples/diffusion/recipes/nemotron_diffusion/conf/cpt_8b.yaml (1)
35-39: Same warmup configuration inconsistency as cpt_3b.yaml.
lr_warmup_fraction: 0.01conflicts withlr_warmup_iters: 0. Consider aligning these across all CPT configs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/diffusion/recipes/nemotron_diffusion/conf/cpt_8b.yaml` around lines 35 - 39, In the scheduler block, align the warmup settings so they aren't contradictory: ensure lr_warmup_fraction and lr_warmup_iters in the scheduler section are consistent (either set lr_warmup_fraction to 0 if lr_warmup_iters should be 0, or set lr_warmup_iters to the appropriate non-zero value derived from lr_warmup_fraction), matching the approach used in the other CPT config (cpt_3b) for consistency across CPT configs; update the lr_warmup_fraction and/or lr_warmup_iters values accordingly.examples/diffusion/recipes/nemotron_diffusion/conf/cpt_14b.yaml (1)
35-39: Same warmup configuration inconsistency as other CPT configs.Consider standardizing the warmup configuration across all CPT YAML files.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/diffusion/recipes/nemotron_diffusion/conf/cpt_14b.yaml` around lines 35 - 39, The scheduler block in this YAML uses inconsistent warmup fields (lr_warmup_fraction: 0.01 and lr_warmup_iters: 0) compared to other CPT configs; update the scheduler keys (lr_warmup_fraction and lr_warmup_iters) to the canonical values used across CPT YAMLs so the warmup behavior is identical, keeping lr_decay_style and lr_wsd_decay_iters as-is; locate the scheduler block (symbols: scheduler, lr_warmup_fraction, lr_warmup_iters, lr_decay_style, lr_wsd_decay_iters) and copy the standard warmup values from any other CPT config into this file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py`:
- Around line 39-41: The CLI's --model-size is ignored because main() always
calls nemotron_diffusion3_pretrain_config(...) regardless of args.model_size;
update main() (and any config construction calls around the existing uses at
59-69 and 112-119) to thread args.model_size into config creation by either (a)
mapping values "3B"/"8B"/"14B" to distinct factory functions (e.g.,
nemotron_diffusion3_pretrain_config, nemotron_diffusion8_pretrain_config,
nemotron_diffusion14_pretrain_config) and selecting the correct one based on
args.model_size, or (b) extend nemotron_diffusion3_pretrain_config to accept a
model_size parameter and use that inside the function to build the appropriate
config; ensure every place that previously called
nemotron_diffusion3_pretrain_config now uses the selected factory or passes
args.model_size so the generated run config actually reflects the CLI flag.
In `@examples/diffusion/recipes/nemotron_diffusion/eval_megatron.py`:
- Around line 494-507: The per-rank JSON read-modify-write in _log_nfe (and
similarly _log_latency) races in distributed runs; modify these functions to
avoid concurrent writes by either: 1) gathering nfe/latency values from all
ranks to rank 0 (use the existing distributed backend utilities) and have only
rank 0 perform the read-modify-write to self.nfe_log_path/self.latency_log_path,
or 2) write per-rank files (e.g., append rank id to the filenames) so each rank
writes independently and a separate merge step combines them later; update
_log_nfe and _log_latency to implement one of these two strategies and ensure
you only perform json.load/json.dump on a single writer to prevent corrupted
JSON.
- Around line 469-492: The code currently collapses a batched generation by
taking only tokenizer.batch_decode(...)[0], so generated_tokens, tokenized_out
and generate_until drop all but the first item and misapply per-request stop
logic (e.g., gen_args[0]["until"]) for batch_size>1; update the logic in the
decode/return path (referencing generated_tokens, tokenized_out,
tokenizer.batch_decode and generate_until/gen_args) to decode the entire batch
into a list of strings (one per batch item), preserve None->"" handling per
element, and return that list so downstream per-item until/stop processing and
latency/accounting remain aligned with each input; apply the same change to the
analogous block around lines 543-551.
- Around line 179-180: Validate tensor-parallel (tp) and pipeline-parallel (pp)
values immediately after tp = int(tp) and pp = int(pp): ensure tp > 0 and pp >
0, and ensure world_size is divisible by tp * pp (so DP size = world_size //
(tp*pp) is an integer); if any check fails, raise a clear ValueError (or log and
sys.exit) with a message mentioning tp, pp and world_size before any
process-group construction (the sections that build DP/TP/PP groups around the
global_rank/world_size logic in the subsequent block lines 189-227 and 233-235).
This prevents tp==0 (which causes global_rank % tp crashes) and non-divisible
world sizes that would create incomplete/hanging collectives.
- Around line 608-614: Replace the current dummy-return behavior in the methods
loglikelihood and loglikelihood_rolling: instead of logging a warning and
returning constant zeros, raise a descriptive NotImplementedError explaining
that likelihood evaluation is not supported for diffusion models; update both
functions (loglikelihood and loglikelihood_rolling) to raise that error (include
a message referencing the diffusion model limitation) so callers fail fast
rather than receiving fabricated scores.
In
`@src/megatron/bridge/diffusion/conversion/nemotron_diffusion/nemotron_diffusion_bridge.py`:
- Around line 46-56: The bridge decorator is being evaluated even when
Mistral3ForConditionalGeneration is None, causing an AttributeError during
import; guard the registration by only defining/decorating
NemotronDiffusionBridge when HAS_MISTRAL3 is True: wrap the
MegatronModelBridge.register_bridge(...) decorator and the
NemotronDiffusionBridge class definition inside an if HAS_MISTRAL3: block so the
decorator call (and any use of source.__name__) is skipped when the optional
dependency is unavailable.
In `@src/megatron/bridge/diffusion/models/common/dgpt_step.py`:
- Around line 261-277: The divide_by_masked_tokens flag is threaded into
_create_loss_function_sbd but never consumed, so update
_masked_loss_sbd_block_diff to accept a divide_by_masked_tokens parameter and,
when true, compute the number of masked tokens from the provided loss_mask
(avoid zero by max(count, 1)) and divide the DLM loss term by that count (leave
AR loss behavior unchanged unless desired); ensure the function signature name
_masked_loss_sbd_block_diff includes divide_by_masked_tokens and that any other
factory/usage sites referenced in the diff region (the other create helper
around lines 280-327) are similarly wired to pass the flag through.
- Around line 63-67: DGPTStep currently assumes pipeline parallelism of 1 and
will hit None.clone() when PP>1; add a fail-fast guard in DGPTStep.__call__ (or
at start of get_batch_from_iterator) that checks pipeline model parallel world
size via parallel_state.get_pipeline_model_parallel_world_size() (or equivalent)
and raises NotImplementedError with a clear message like "DGPTStep does not
support pipeline parallelism > 1" when the size > 1; reference
DGPTStep.__call__, get_batch_from_iterator, and _apply_noise to ensure the error
triggers before any cloning or calls to _apply_noise so stages that only
materialize tokens or labels will not access None.
- Around line 294-327: The current validation only checks dlm_loss but the final
objective is loss = dlm_loss * dlm_loss_weight + ar_loss * ar_loss_weight, so
add the same rerun_state_machine.validate_result checks (NaN, Inf, and spiky via
is_unexpectedly_large) on the combined objective tensor (e.g., validate the
computed loss tensor before it is returned/used) using the same
messages/tolerance/fatal flags; locate get_rerun_state_machine(), dlm_loss,
ar_loss, dlm_loss_weight, ar_loss_weight, rerun_state_machine.validate_result
and the loss assignment to perform the validations on that combined loss (keep
existing per-branch semantics: fatal for NaN/Inf, non-fatal for spiky).
In `@src/megatron/bridge/diffusion/models/common/nemotron_diffusion_attention.py`:
- Around line 193-206: The RoPE initialization is currently nested inside
apply_llama4_style_query_key_layer_scaling causing self.rope_embedding_module,
self.beta and self.max_position_embeddings to be undefined when that flag is
false; refactor the constructor so that creation of the rotary module
(Ministral3RotaryEmbedding) and assignment of self.rope_embedding_module and
self.max_position_embeddings always happens (using a suitable hf_text_config or
default), and move only the extra Llama-4 specific assignments (self.beta and
modifying rope_parameters["factor"] / original_max_position_embeddings) into the
apply_llama4_style_query_key_layer_scaling branch; then update any
attention/forward code that multiplies queries by the Llama-4 scale to guard
that multiplication with if self.beta is not None so scaling is applied only
when initialized.
In
`@src/megatron/bridge/diffusion/models/nemotron_diffusion/inference_nemotron_diffusion.py`:
- Around line 237-248: The current token loop using max_new_tokens, next_logits,
temperature and generated only breaks when every batch row hits eos_token_id
simultaneously; change it to maintain a per-sample finished mask (e.g., finished
= torch.zeros(batch_size, dtype=torch.bool, device=logits.device)), update
finished |= (next_token.squeeze(-1) == eos_token_id) each step, use torch.where
to keep next_token for already finished rows (or replace with eos_token_id/pad)
so you do not overwrite completed rows when concatenating into generated, and
break the loop when finished.all(); ensure places referencing next_logits,
torch.multinomial, and generated use the mask so sampling/updating only occurs
for non-finished samples.
- Around line 430-448: The loop that updates KV caches can leave output_logits
None when model_schedule omits the primary model, causing dream-style
multi-block generation to crash at next_logits_context; to fix, ensure
output_logits is always populated by choosing a deterministic source: after
computing _logits in the kv_update_models loop (using _model_forward on each
kv_model), set output_logits to _logits if kv_model is model else if
output_logits is still None set it to _logits (i.e., default to the first
kv_model's logits), or alternatively validate at the start of the function that
model_schedule/_unique_sched_models contains model and raise a clear error;
update references around kv_update_models, _unique_sched_models, model,
output_logits, _model_forward, dream_style, and next_logits_context accordingly.
- Around line 472-510: Add a guard at the start of tp_follower_loop to verify TP
communication was initialized (e.g., that _TP_GROUP and _TP_SRC_GLOBAL_RANK have
been set by set_tp_group or equivalent) and raise a descriptive RuntimeError if
not so broadcasts won't run on the default world group; also add a final else
branch for unknown cmd values that raises a descriptive RuntimeError identifying
the unsupported command code and expected TP command set. Locate these changes
in tp_follower_loop and the command-handling logic where
_TP_GROUP/_TP_SRC_GLOBAL_RANK are referenced and where cmd is matched.
In
`@src/megatron/bridge/diffusion/models/nemotron_diffusion/nemotron_diffusion_provider.py`:
- Around line 40-52: The current provide method mutates
self.transformer_layer_spec to the resolved ModuleSpec which prevents later
calls with different vp_stage from being re-resolved; instead, keep the original
callable and only use a patched spec for the duration of the parent call: store
original = self.transformer_layer_spec, resolve transformer_layer_spec (calling
it with vp_stage if it's callable and not a ModuleSpec), apply the patch (set
submodules...core_attention = NemotronDiffusionAttention) on that resolved spec
(or a shallow copy if needed), set self.transformer_layer_spec to the patched
spec, call result = super().provide(pre_process, post_process, vp_stage), then
restore self.transformer_layer_spec = original and return result; reference
provide, transformer_layer_spec, ModuleSpec, NemotronDiffusionAttention, and
super().provide when making the change.
In `@src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py`:
- Around line 159-164: The helper call to
distributed_fused_adam_with_cosine_annealing_dllm inside
_nemotron_diffusion3_common is not receiving the selected precision, so the
optimizer falls back to its BF16 default and produces wrong
OptimizerConfig.bf16/fp16 flags; update the call sites (including the similar
block around lines 233-281) to accept the precision_config parameter from
_nemotron_diffusion3_common, normalize its string to the repo convention (e.g.,
map variants to "bf16_mixed" or "fp16_mixed" as appropriate), and pass that
normalized precision into distributed_fused_adam_with_cosine_annealing_dllm so
it can derive the correct optimizer flags (bf16/fp16) rather than defaulting to
BF16. Ensure the normalization logic is used consistently wherever optimizer
precision is determined.
- Around line 216-224: The CheckpointConfig currently forces finetune=True
(CheckpointConfig(..., finetune=True,...)), which breaks normal resume semantics
when later loading from checkpoint.load; change the assignment so finetune is
not hardcoded — either accept a parameter or set finetune =
bool(pretrained_checkpoint) (or finetune = pretrained_checkpoint is not None) so
finetune is true only when bootstrapping from an AR pretrained_checkpoint and
false for normal restarts from checkpoint_dir, preserving optimizer/iteration
resume state; update the construction that references CheckpointConfig,
pretrained_checkpoint and checkpoint_dir accordingly.
- Around line 52-53: The function signature currently types model_provider as
NemotronDiffusionModelProvider | None but the code calls model_provider(), so
change the parameter type to a Callable (e.g., Callable[...,
NemotronDiffusionModelProvider] | None) and add an explicit validation at the
start of the function that ensures exactly one of hf_path or model_provider is
provided; if both are None (or both provided) raise a clear ValueError
describing that hf_path or a model_provider callable is required. Update any
references to model_provider usage (the place where model_provider() is invoked)
to rely on the validated callable and keep NemotronDiffusionModelProvider as the
returned/expected type.
---
Minor comments:
In `@examples/diffusion/recipes/nemotron_diffusion/conf/ar_to_dlm_3b_dlm.yaml`:
- Around line 1-3: The config header incorrectly references the old entrypoint
filename "pretrain_dlm_3b.py"; update that string to the new entrypoint
"ar_to_dlm.py" so users are directed to the current workflow; locate the header
comment containing "Use with pretrain_dlm_3b.py" and change it to "Use with
ar_to_dlm.py" (verify there are no other occurrences of "pretrain_dlm_3b.py" in
the file such as example commands or docs within the same YAML).
In `@src/megatron/bridge/diffusion/common/dllm.py`:
- Around line 42-47: The returned p_mask must be updated to reflect tokens
forced unmasked by loss_mask; after you apply loss_mask to masked_indices (the
line masked_indices[loss_mask == 0] = 0), also zero out or mask p_mask at the
same positions so p_mask matches the effective mask probability for ignored
tokens (e.g., set p_mask[loss_mask == 0] = 0 or p_mask = p_mask * loss_mask,
taking care to preserve tensor dtype/device). This keeps noisy_batch,
masked_indices and p_mask consistent (refer to variables masked_indices,
loss_mask, p_mask, noisy_batch, mask_token_id, input_ids in dllm.py).
In
`@src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py`:
- Around line 114-118: The docstring for the function in
continuous_pretraining.py claims the default TP=2, MBS=1 but the call to
_nemotron_diffusion_cpt_config sets tensor_model_parallel_size=4; update the
call so tensor_model_parallel_size is 2 (or alternatively update the docstring
to match 4) to keep the docstring and code consistent—look for the return in the
function that calls _nemotron_diffusion_cpt_config and change
tensor_model_parallel_size to 2 if you want to match "Default: TP=2, MBS=1".
- Around line 97-102: The docstring for the 3B config in
continuous_pretraining.py says "Default: TP=1, MBS=2" but the call to
_nemotron_diffusion_cpt_config sets micro_batch_size=1; update the call to
_nemotron_diffusion_cpt_config (the 3B config factory) to set micro_batch_size=2
so the code matches the docstring (or alternatively update the docstring if 1 is
intended) — reference the _nemotron_diffusion_cpt_config invocation in the 3B
config return.
In `@tests/unit_tests/diffusion/model/nemotron_diffusion/test_dgpt_step.py`:
- Line 171: The unpacked return values from step._apply_noise and other test
calls include variables that are unused; rename the unused locals by prefixing
them with an underscore to satisfy Ruff RUF059 (e.g., change masked_indices and
p_mask returned from step._apply_noise to _masked_indices and _p_mask, and
rename the unused num_tokens and report variables at the other test locations to
_num_tokens and _report). Locate the unpack in the test function using
step._apply_noise and any other assertions/assignments at the reported spots and
update only the variable names to their underscored equivalents so the tests'
behavior remains unchanged while resolving the lint warnings.
---
Nitpick comments:
In `@examples/diffusion/recipes/nemotron_diffusion/conf/cpt_14b.yaml`:
- Around line 35-39: The scheduler block in this YAML uses inconsistent warmup
fields (lr_warmup_fraction: 0.01 and lr_warmup_iters: 0) compared to other CPT
configs; update the scheduler keys (lr_warmup_fraction and lr_warmup_iters) to
the canonical values used across CPT YAMLs so the warmup behavior is identical,
keeping lr_decay_style and lr_wsd_decay_iters as-is; locate the scheduler block
(symbols: scheduler, lr_warmup_fraction, lr_warmup_iters, lr_decay_style,
lr_wsd_decay_iters) and copy the standard warmup values from any other CPT
config into this file.
In `@examples/diffusion/recipes/nemotron_diffusion/conf/cpt_3b.yaml`:
- Around line 35-39: The scheduler warmup settings are inconsistent:
lr_warmup_fraction is set to 0.01 while lr_warmup_iters is 0; decide the
intended behavior and make them consistent by either (A) keeping a fractional
warmup and computing/setting lr_warmup_iters accordingly (e.g., set
lr_warmup_iters = int(total_training_iters * lr_warmup_fraction) wherever the
scheduler config is assembled, updating the scheduler config generation code
that reads lr_warmup_fraction and lr_warmup_iters), or (B) disabling warmup by
setting lr_warmup_fraction: 0 (or removing it) and leaving lr_warmup_iters: 0;
ensure the scheduler config (symbols: lr_warmup_fraction, lr_warmup_iters,
scheduler) is updated so only one source of truth is used.
In `@examples/diffusion/recipes/nemotron_diffusion/conf/cpt_8b.yaml`:
- Around line 35-39: In the scheduler block, align the warmup settings so they
aren't contradictory: ensure lr_warmup_fraction and lr_warmup_iters in the
scheduler section are consistent (either set lr_warmup_fraction to 0 if
lr_warmup_iters should be 0, or set lr_warmup_iters to the appropriate non-zero
value derived from lr_warmup_fraction), matching the approach used in the other
CPT config (cpt_3b) for consistency across CPT configs; update the
lr_warmup_fraction and/or lr_warmup_iters values accordingly.
In
`@tests/unit_tests/diffusion/model/nemotron_diffusion/test_nemotron_diffusion_attention.py`:
- Around line 186-190: The test unpacks two outputs from apply_rotary_pos_emb
into q_e and k_e but never uses k_e; change the unpacked variable to _k_e (or
prefix it with an underscore) in the test where apply_rotary_pos_emb(q, k, cos,
sin, unsqueeze_dim=0) is called so static analysis recognizes the unused value
is intentional and keeps the rest of the assertions (q_e checks) unchanged.
- Around line 386-391: The _qkv helper declares a parameter defaulting to None
without an explicit None-bearing type; update its signature to explicitly allow
None (e.g., change the parameter annotation in _qkv to "sq: int | None = None"
or "Optional[int] = None") so the type indicates None is permitted, leaving the
body and return unchanged; reference the _qkv method and its sq parameter when
making this edit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 25508144-bb9b-494f-8eb8-37ed0de852a9
📒 Files selected for processing (32)
examples/diffusion/recipes/nemotron_diffusion/README.mdexamples/diffusion/recipes/nemotron_diffusion/__init__.pyexamples/diffusion/recipes/nemotron_diffusion/ar_to_dlm.pyexamples/diffusion/recipes/nemotron_diffusion/conf/ar_to_dlm_14b_dlm.yamlexamples/diffusion/recipes/nemotron_diffusion/conf/ar_to_dlm_3b_dlm.yamlexamples/diffusion/recipes/nemotron_diffusion/conf/ar_to_dlm_8b_dlm.yamlexamples/diffusion/recipes/nemotron_diffusion/conf/cpt_14b.yamlexamples/diffusion/recipes/nemotron_diffusion/conf/cpt_3b.yamlexamples/diffusion/recipes/nemotron_diffusion/conf/cpt_8b.yamlexamples/diffusion/recipes/nemotron_diffusion/continuous_pretraining.pyexamples/diffusion/recipes/nemotron_diffusion/eval_megatron.pysrc/megatron/bridge/diffusion/README.mdsrc/megatron/bridge/diffusion/common/__init__.pysrc/megatron/bridge/diffusion/common/dllm.pysrc/megatron/bridge/diffusion/conversion/nemotron_diffusion/__init__.pysrc/megatron/bridge/diffusion/conversion/nemotron_diffusion/nemotron_diffusion_bridge.pysrc/megatron/bridge/diffusion/models/common/dgpt_step.pysrc/megatron/bridge/diffusion/models/common/nemotron_diffusion_attention.pysrc/megatron/bridge/diffusion/models/nemotron_diffusion/__init__.pysrc/megatron/bridge/diffusion/models/nemotron_diffusion/inference_nemotron_diffusion.pysrc/megatron/bridge/diffusion/models/nemotron_diffusion/nemotron_diffusion_provider.pysrc/megatron/bridge/diffusion/recipes/nemotron_diffusion/__init__.pysrc/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.pysrc/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.pytests/unit_tests/diffusion/model/nemotron_diffusion/__init__.pytests/unit_tests/diffusion/model/nemotron_diffusion/conversion/__init__.pytests/unit_tests/diffusion/model/nemotron_diffusion/conversion/test_nemotron_diffusion_bridge.pytests/unit_tests/diffusion/model/nemotron_diffusion/test_dgpt_step.pytests/unit_tests/diffusion/model/nemotron_diffusion/test_inference.pytests/unit_tests/diffusion/model/nemotron_diffusion/test_nemotron_diffusion_attention.pytests/unit_tests/diffusion/recipes/nemotron_diffusion/__init__.pytests/unit_tests/diffusion/recipes/nemotron_diffusion/test_nemotron_diffusion_provider.py
| def tp_follower_loop(model): | ||
| """Blocking loop for TP-non-rank-0 processes. | ||
|
|
||
| Waits for commands from the TP-rank-0 process and mirrors all model | ||
| operations (set_inference_mode, set_inference_params, clear_kv_cache, | ||
| model forward) so Megatron TP communication stays in sync. | ||
| """ | ||
| while True: | ||
| cmd = torch.zeros(1, dtype=torch.long, device="cuda") | ||
| torch.distributed.broadcast(cmd, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP) | ||
| cmd = cmd.item() | ||
|
|
||
| if cmd == _CMD_STOP: | ||
| break | ||
| elif cmd == _CMD_FORWARD: | ||
| input_ids = _broadcast_tensor( | ||
| torch.zeros(1, 1, dtype=torch.long, device="cuda"), | ||
| _TP_SRC_GLOBAL_RANK, | ||
| _TP_GROUP, | ||
| ) | ||
| seq_len = input_ids.shape[1] | ||
| position_ids = torch.arange(seq_len, device=input_ids.device).unsqueeze(0).expand_as(input_ids) | ||
| model(input_ids=input_ids, position_ids=position_ids, attention_mask=None) | ||
| elif cmd == _CMD_SET_INF_MODE_ON: | ||
| for attn in _get_core_attentions(model): | ||
| attn.set_inference_mode(True) | ||
| elif cmd == _CMD_SET_INF_MODE_OFF: | ||
| for attn in _get_core_attentions(model): | ||
| attn.set_inference_mode(False) | ||
| elif cmd == _CMD_SET_PARAMS: | ||
| extra = torch.zeros(2, dtype=torch.long, device="cuda") | ||
| torch.distributed.broadcast(extra, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP) | ||
| causal = bool(extra[0].item()) | ||
| cache_enabled = bool(extra[1].item()) | ||
| for attn in _get_core_attentions(model): | ||
| attn.set_inference_params(causal, cache_enabled) | ||
| elif cmd == _CMD_CLEAR_CACHE: | ||
| for attn in _get_core_attentions(model): | ||
| attn.clear_kv_cache() |
There was a problem hiding this comment.
Guard tp_follower_loop() before broadcasting on the wrong group.
If a follower enters this loop before set_tp_group(), every broadcast here runs with group=None, i.e. the default world group. In a TP job that can hang the whole process set instead of failing fast. Add an explicit setup check at the top and raise a descriptive RuntimeError (same for unknown command codes). Based on learnings, when a feature is not supported, raise an explicit error with a descriptive message that identifies the unsupported feature and the expected behavior.
Suggested fix
def tp_follower_loop(model):
"""Blocking loop for TP-non-rank-0 processes.
@@
Waits for commands from the TP-rank-0 process and mirrors all model
operations (set_inference_mode, set_inference_params, clear_kv_cache,
model forward) so Megatron TP communication stays in sync.
"""
+ if _TP_GROUP is None:
+ raise RuntimeError("set_tp_group() must be called before tp_follower_loop().")
+
while True:
cmd = torch.zeros(1, dtype=torch.long, device="cuda")
torch.distributed.broadcast(cmd, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def tp_follower_loop(model): | |
| """Blocking loop for TP-non-rank-0 processes. | |
| Waits for commands from the TP-rank-0 process and mirrors all model | |
| operations (set_inference_mode, set_inference_params, clear_kv_cache, | |
| model forward) so Megatron TP communication stays in sync. | |
| """ | |
| while True: | |
| cmd = torch.zeros(1, dtype=torch.long, device="cuda") | |
| torch.distributed.broadcast(cmd, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP) | |
| cmd = cmd.item() | |
| if cmd == _CMD_STOP: | |
| break | |
| elif cmd == _CMD_FORWARD: | |
| input_ids = _broadcast_tensor( | |
| torch.zeros(1, 1, dtype=torch.long, device="cuda"), | |
| _TP_SRC_GLOBAL_RANK, | |
| _TP_GROUP, | |
| ) | |
| seq_len = input_ids.shape[1] | |
| position_ids = torch.arange(seq_len, device=input_ids.device).unsqueeze(0).expand_as(input_ids) | |
| model(input_ids=input_ids, position_ids=position_ids, attention_mask=None) | |
| elif cmd == _CMD_SET_INF_MODE_ON: | |
| for attn in _get_core_attentions(model): | |
| attn.set_inference_mode(True) | |
| elif cmd == _CMD_SET_INF_MODE_OFF: | |
| for attn in _get_core_attentions(model): | |
| attn.set_inference_mode(False) | |
| elif cmd == _CMD_SET_PARAMS: | |
| extra = torch.zeros(2, dtype=torch.long, device="cuda") | |
| torch.distributed.broadcast(extra, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP) | |
| causal = bool(extra[0].item()) | |
| cache_enabled = bool(extra[1].item()) | |
| for attn in _get_core_attentions(model): | |
| attn.set_inference_params(causal, cache_enabled) | |
| elif cmd == _CMD_CLEAR_CACHE: | |
| for attn in _get_core_attentions(model): | |
| attn.clear_kv_cache() | |
| def tp_follower_loop(model): | |
| """Blocking loop for TP-non-rank-0 processes. | |
| Waits for commands from the TP-rank-0 process and mirrors all model | |
| operations (set_inference_mode, set_inference_params, clear_kv_cache, | |
| model forward) so Megatron TP communication stays in sync. | |
| """ | |
| if _TP_GROUP is None: | |
| raise RuntimeError("set_tp_group() must be called before tp_follower_loop().") | |
| while True: | |
| cmd = torch.zeros(1, dtype=torch.long, device="cuda") | |
| torch.distributed.broadcast(cmd, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP) | |
| cmd = cmd.item() | |
| if cmd == _CMD_STOP: | |
| break | |
| elif cmd == _CMD_FORWARD: | |
| input_ids = _broadcast_tensor( | |
| torch.zeros(1, 1, dtype=torch.long, device="cuda"), | |
| _TP_SRC_GLOBAL_RANK, | |
| _TP_GROUP, | |
| ) | |
| seq_len = input_ids.shape[1] | |
| position_ids = torch.arange(seq_len, device=input_ids.device).unsqueeze(0).expand_as(input_ids) | |
| model(input_ids=input_ids, position_ids=position_ids, attention_mask=None) | |
| elif cmd == _CMD_SET_INF_MODE_ON: | |
| for attn in _get_core_attentions(model): | |
| attn.set_inference_mode(True) | |
| elif cmd == _CMD_SET_INF_MODE_OFF: | |
| for attn in _get_core_attentions(model): | |
| attn.set_inference_mode(False) | |
| elif cmd == _CMD_SET_PARAMS: | |
| extra = torch.zeros(2, dtype=torch.long, device="cuda") | |
| torch.distributed.broadcast(extra, src=_TP_SRC_GLOBAL_RANK, group=_TP_GROUP) | |
| causal = bool(extra[0].item()) | |
| cache_enabled = bool(extra[1].item()) | |
| for attn in _get_core_attentions(model): | |
| attn.set_inference_params(causal, cache_enabled) | |
| elif cmd == _CMD_CLEAR_CACHE: | |
| for attn in _get_core_attentions(model): | |
| attn.clear_kv_cache() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@src/megatron/bridge/diffusion/models/nemotron_diffusion/inference_nemotron_diffusion.py`
around lines 472 - 510, Add a guard at the start of tp_follower_loop to verify
TP communication was initialized (e.g., that _TP_GROUP and _TP_SRC_GLOBAL_RANK
have been set by set_tp_group or equivalent) and raise a descriptive
RuntimeError if not so broadcasts won't run on the default world group; also add
a final else branch for unknown cmd values that raises a descriptive
RuntimeError identifying the unsupported command code and expected TP command
set. Locate these changes in tp_follower_loop and the command-handling logic
where _TP_GROUP/_TP_SRC_GLOBAL_RANK are referenced and where cmd is matched.
src/megatron/bridge/diffusion/models/nemotron_diffusion/nemotron_diffusion_provider.py
Outdated
Show resolved
Hide resolved
| model_provider: NemotronDiffusionModelProvider | None = None, | ||
| hf_path: str | None = None, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and read the relevant sections
head -60 src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py | cat -nRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2652
🏁 Script executed:
# Also check the "Also applies to" section
sed -n '130,150p' src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py | cat -nRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1119
🏁 Script executed:
# Check the imports at the top of the file
head -20 src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py | cat -nRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 942
🏁 Script executed:
# Search for how model_provider is used in the file
rg "model_provider" src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py -A 2 -B 1Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 642
🏁 Script executed:
# Get full context of the function to see all usages of model_provider
sed -n '50,200p' src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py | cat -nRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 8084
🏁 Script executed:
# Check if there are any other usages of model_provider() calls
rg "model_provider\(\)" src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 107
🏁 Script executed:
# Check the full function signature and docstring
sed -n '51,150p' src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py | cat -nRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 5828
🏁 Script executed:
# Search for other function definitions with model_provider parameter
rg "def.*model_provider" src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Check for any other parameter definitions with model_provider in the file
rg "model_provider.*:" src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py -B 3Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 355
🏁 Script executed:
# Get full file line count to understand scope
wc -l src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 144
🏁 Script executed:
# Check what's at lines 134-142 in the actual file
sed -n '130,145p' src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py | cat -nRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 850
🏁 Script executed:
# Search for any other model_provider() calls in the entire file
rg "model_provider\(\)" src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py -B 5 -A 1Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 274
Change model_provider parameter type to Callable and add explicit None check.
The parameter is annotated as NemotronDiffusionModelProvider | None, but the code calls model_provider() at line 92. This mismatch causes TypeError if callers follow the signature and pass an instance. Additionally, there is no None check before the call; both hf_path and model_provider can be None simultaneously, leading to a cryptic error. Add explicit validation to fail fast with a descriptive error message.
Suggested fix
import os
-from typing import Optional
+from typing import Callable, Optional
@@
def _nemotron_diffusion3_common(
- model_provider: NemotronDiffusionModelProvider | None = None,
+ model_provider: Callable[[], NemotronDiffusionModelProvider] | None = None,
@@
if hf_path is not None:
bridge = AutoBridge.from_hf_pretrained(hf_path)
model_cfg = bridge.to_megatron_provider(
load_weights=bool(pretrained_checkpoint),
hf_path=pretrained_checkpoint,
)
else:
+ if model_provider is None:
+ raise ValueError("Either `hf_path` or `model_provider` must be provided.")
model_cfg = model_provider()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py` around
lines 52 - 53, The function signature currently types model_provider as
NemotronDiffusionModelProvider | None but the code calls model_provider(), so
change the parameter type to a Callable (e.g., Callable[...,
NemotronDiffusionModelProvider] | None) and add an explicit validation at the
start of the function that ensures exactly one of hf_path or model_provider is
provided; if both are None (or both provided) raise a clear ValueError
describing that hf_path or a model_provider callable is required. Update any
references to model_provider usage (the place where model_provider() is invoked)
to rely on the validated callable and keep NemotronDiffusionModelProvider as the
returned/expected type.
| opt_cfg, scheduler_cfg = distributed_fused_adam_with_cosine_annealing_dllm( | ||
| lr_warmup_iters=lr_warmup_iters, | ||
| lr_decay_iters=lr_decay_iters, | ||
| max_lr=lr, | ||
| min_lr=min_lr, | ||
| ) |
There was a problem hiding this comment.
The optimizer precision is disconnected from precision_config.
_nemotron_diffusion3_common() accepts precision_config, but this helper call never forwards it, so distributed_fused_adam_with_cosine_annealing_dllm() always falls back to its BF16 default. That means non-BF16 recipes will still set OptimizerConfig.bf16/fp16 incorrectly, and the helper’s hyphenated literals do not match the repo’s "bf16_mixed" spelling used elsewhere in this file. Please thread the selected precision through and normalize the string before deriving the optimizer flags.
Also applies to: 233-281
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py` around
lines 159 - 164, The helper call to
distributed_fused_adam_with_cosine_annealing_dllm inside
_nemotron_diffusion3_common is not receiving the selected precision, so the
optimizer falls back to its BF16 default and produces wrong
OptimizerConfig.bf16/fp16 flags; update the call sites (including the similar
block around lines 233-281) to accept the precision_config parameter from
_nemotron_diffusion3_common, normalize its string to the repo convention (e.g.,
map variants to "bf16_mixed" or "fp16_mixed" as appropriate), and pass that
normalized precision into distributed_fused_adam_with_cosine_annealing_dllm so
it can derive the correct optimizer flags (bf16/fp16) rather than defaulting to
BF16. Ensure the normalization logic is used consistently wherever optimizer
precision is determined.
cuichenx
left a comment
There was a problem hiding this comment.
Couple of overall comments
- please run lint on all files
- let's use copyright 2026 throughout the PR
- might be worth it to add a functional test in addition to the unit tests
Please also see comments below
| HAS_MISTRAL3 = False | ||
|
|
||
|
|
||
| @MegatronModelBridge.register_bridge(source=Mistral3ForConditionalGeneration, target=GPTModel) |
There was a problem hiding this comment.
will there be any issues in non-diffusion ministral 3 workloads if this bridge is imported instead?
There was a problem hiding this comment.
thanks for catching this. I think i need to ask for the name of the model in HF.
| Select model size with --model-size {3b,8b,14b} (default: 14b). | ||
| Use --hf-path to override the HuggingFace model ID or local model path. | ||
| Configuration is overridden via YAML and CLI in the same way as pretrain_ministral3_14b.py. | ||
| """ |
There was a problem hiding this comment.
add commands for example usage here, similar to other examples
There was a problem hiding this comment.
we generally don't want to have these yaml config file in the repo. Can we include all the hyperparameters in the recipe function? then if anything else is needed, use hydra override in the example command
There was a problem hiding this comment.
sure, makes sense. I'll remove them. Btw, other diffusion models like wan and flux seems to have their yaml file in the repo: https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/diffusion/recipes/wan/conf
|
/claude review |
src/megatron/bridge/diffusion/recipes/nemotron_diffusion/ar_to_dlm.py
Outdated
Show resolved
Hide resolved
src/megatron/bridge/diffusion/recipes/nemotron_diffusion/continuous_pretraining.py
Outdated
Show resolved
Hide resolved
| # --- Denoise the block iteratively --- | ||
| for i in range(steps_per_block): | ||
| step_model = _step_models[i] if _step_models is not None else model | ||
| print(f"Denoising step {i} of {steps_per_block} using {type(step_model).__name__}") |
There was a problem hiding this comment.
This looks like a debug print that was left in. It will fire on every denoising step for every block during generation, which is quite noisy. Consider removing it or changing to logger.debug(...).
| print(f"Denoising step {i} of {steps_per_block} using {type(step_model).__name__}") |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>
- Add TODO comment to check for dedicated HF class for NemotronDiffusion bridge - Remove unused divide_by_masked_tokens parameter from dgpt_step and provider - Fix AR loss validation: extend NaN/Inf/spiky checks to cover ar_loss, not just dlm_loss - Fix RoPE init: always initialize rope_embedding_module unconditionally; guard Llama-4 query scaling with `if self.beta is not None` in both forward paths - Fix VP parallelism bug in provider: restore transformer_layer_spec callable after each provide() call so subsequent vp_stage invocations resolve correctly - Remove model_schedule from generate_dllm (unused, not wired up in CLI) - Add usage examples to ar_to_dlm.py and continuous_pretraining.py docstrings Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Sajad Norouzi <sajad.n@gmail.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Sajad Norouzi <sajad.n@gmail.com>
Scripts referenced in README but missing from the PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>
Ministral3Model wraps GPTModel under .language_model, so _unwrap() needs to unwrap it to reach .decoder.layers for NemotronDiffusionAttention access. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sajad Norouzi <snorouzi@nvidia.com>
What does this PR do ?
Adds support for nemotron-diffusion model to megatron-brdige.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests