Skip to content

cp: [ci] feat: fix Qwen3 PEFT verify test and add missing launch scripts (3070) into r0.4.0#3150

Merged
cuichenx merged 1 commit intor0.4.0from
cherry-pick-3070-r0.4.0
Apr 4, 2026
Merged

cp: [ci] feat: fix Qwen3 PEFT verify test and add missing launch scripts (3070) into r0.4.0#3150
cuichenx merged 1 commit intor0.4.0from
cherry-pick-3070-r0.4.0

Conversation

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor

@svcnvidia-nemo-ci svcnvidia-nemo-ci commented Apr 3, 2026

beep boop [🤖]: Hi @cuichenx 👋,

we've cherry picked #3070 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for GLM-4.5V vision-language model training and conversion
    • Enhanced vision-language model configuration with improved token handling
  • Bug Fixes

    • Improved data collation padding consistency across different model types
    • Fixed distributed test isolation to prevent cross-test state leakage
  • Tests

    • Expanded functional test coverage for vision-language models and recipe training
    • Enhanced test reliability and error recovery mechanisms

…3070)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor Author

/ok to test 1be2696

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

The PR adds support for GLM-4.5V vision-language model processing by implementing specialized collation functions, extending model signatures to handle multi-modal token type IDs, refining QKV weight-splitting logic with feature dimension handling, and introducing functional test infrastructure with proper test isolation.

Changes

Cohort / File(s) Summary
Data Collation Functions
src/megatron/bridge/data/vlm_datasets/collate.py
Added new glm4v_collate_fn with position ID synthesis, loss mask construction with -100 masking, and multi-modal token type ID wrapping. Updated nemotron_nano_v2_vl_collate_fn to enforce padding and handle pad token setup. Modified default_collate_fn to conditionally apply apply_chat_template based on padding capability. Registered Glm4vProcessor mapping.
Model Conversion & QKV Splitting
src/megatron/bridge/models/conversion/param_mapping.py, src/megatron/bridge/models/conversion/peft_bridge.py
Extended split_qkv_weights with optional feature_dim parameter to correctly reshape interleaved QKV tensors. Updated PEFT bridge to compute and pass feature_dim from linear weight dimensions, disambiguating LoRA rank from FP8-compressed hidden size.
GLM-4.5V Model
src/megatron/bridge/models/glm_vl/modeling_glm_45v.py
Added mm_token_type_ids parameter to forward signature. Bound HuggingFace get_vision_position_ids method to instance. Changed RoPE position computation to pad mm_token_type_ids, construct binary attention mask from input_ids, and pass both to get_rope_index.
Visual Input Handling
src/megatron/bridge/training/utils/visual_inputs.py
Added optional mm_token_type_ids field to GenericVisualInputs dataclass, automatically propagated through as_model_kwargs().
Test Launch Scripts
tests/functional_tests/launch_scripts/active/L0_Launch_models_glm_vl.sh, L0_Launch_recipes_glm_vl.sh, L0_Launch_recipes_deepseek.sh, L0_Launch_recipes_gemma.sh, L0_Launch_recipes_perf_config.sh
Added new executable Bash scripts configuring multi-GPU distributed test execution with coverage tracking, logging, and early-failure stops. Scripts set CUDA_VISIBLE_DEVICES and invoke torch.distributed.run or direct pytest with standardized pytest options.
Test Utilities & Qwen Updates
tests/functional_tests/launch_scripts/active/L0_Launch_models_qwen3.sh, tests/functional_tests/test_groups/models/qwen/test_qwen3_peft_verify.py
Extended Qwen3 launch script to include additional PEFT export/verify test modules. Added _find_free_port() helper and autouse _cleanup_distributed() fixture to ensure isolated distributed state per test, with unique MASTER_PORT assignment.
Test Infrastructure & Utilities
tests/functional_tests/test_groups/models/glm_vl/test_glm_45v_conversion.py, tests/functional_tests/test_groups/recipes/test_glm_45v_recipes_finetune.py, tests/functional_tests/test_groups/recipes/utils.py
Added model directory creation in GLM-4.5V conversion fixture. Introduced _skip_if_model_unavailable() helper for conditional test skipping on HuggingFace model unavailability. Refactored VLM recipe test utility to disable checkpoint save/load by default and conditionally verify only when enabled.

Sequence Diagram

sequenceDiagram
    participant DataLoader
    participant CollateFunc as Collate Function<br/>(glm4v_collate_fn)
    participant VisualInputs as GenericVisualInputs
    participant Model as GLM45VModel.forward
    participant RopeIndex as RoPE Position<br/>Computation

    DataLoader->>CollateFunc: examples list
    activate CollateFunc
    CollateFunc->>CollateFunc: apply_chat_template<br/>with padding
    CollateFunc->>CollateFunc: synthesize position_ids
    CollateFunc->>CollateFunc: build loss_mask<br/>with -100 masking
    CollateFunc->>VisualInputs: wrap visual tensors<br/>+ mm_token_type_ids
    CollateFunc-->>DataLoader: batch dict
    deactivate CollateFunc

    DataLoader->>Model: input_ids, mm_token_type_ids,<br/>visual_inputs
    activate Model
    Model->>Model: pad mm_token_type_ids<br/>if needed
    Model->>Model: construct hf_attention_mask<br/>from input_ids
    Model->>RopeIndex: hf_attention_mask,<br/>mm_token_type_ids
    activate RopeIndex
    RopeIndex-->>Model: position_ids
    deactivate RopeIndex
    Model-->>DataLoader: model output
    deactivate Model
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

feature, needs-review

Suggested reviewers

  • cuichenx
  • ko3n1g
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR contains major changes to model signatures, collate functions, and parameter conversion logic, but PR description lacks any documented test results, regression verification, or performance benchmarks. Add test results to PR description confirming existing tests pass, functional test outcomes for GLM-4.5V collate function changes, and numerical accuracy validation for parameter mapping modifications.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: fixing the Qwen3 PEFT verify test and adding missing launch scripts, with a cherry-pick reference to PR #3070 into the r0.4.0 branch.
Docstring Coverage ✅ Passed Docstring coverage is 95.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cherry-pick-3070-r0.4.0

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/megatron/bridge/models/conversion/param_mapping.py (1)

2558-2595: ⚠️ Potential issue | 🟡 Minor

Add unit test coverage for the new feature_dim parameter path.

The new conditional branch (line 2592) handling explicit feature_dim values is already used in production (peft_bridge.py:312 for LoRA tensors) but lacks explicit test coverage. The existing test only covers the default feature_dim=None case. Add a unit test for split_qkv_weights(..., feature_dim=<rank>) with a 2D LoRA tensor to ensure the new reshape semantics work correctly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/conversion/param_mapping.py` around lines 2558 -
2595, Add a unit test that calls split_qkv_weights(provider, qkv,
feature_dim=<rank>) with a 2D LoRA-style tensor to exercise the feature_dim
branch: construct a TransformerConfig-like provider (set num_attention_heads,
num_query_groups, kv_channels or hidden_size, and attention_output_gate as
needed), compute qkv_total_dim and head_size per the same formula, create a 2D
qkv tensor of shape (qkv_total_dim * head_size, feature_dim) with known values,
call split_qkv_weights(..., feature_dim=feature_dim) and assert the returned Q,
K, V tensors have the expected shapes (qkv_total_dim/head splits × head_size ×
feature_dim or equivalent per attention_output_gate) and that their contents
match the corresponding slices from the original tensor to verify reshape/split
semantics.
🧹 Nitpick comments (5)
tests/functional_tests/test_groups/recipes/test_glm_45v_recipes_finetune.py (1)

25-33: Consider returning the config to avoid duplicate instantiation.

The helper returns config_func() but the callers on lines 83 and 94 discard the return value. This means config_func() is called twice: once here for the availability check and again inside run_pretrain_vl_recipe_test. For expensive config functions (e.g., those that download HF configs), this doubles the overhead.

♻️ Proposed optimization

Either modify the helper to not return the value (making the intent clearer):

 def _skip_if_model_unavailable(config_func):
     """Call config_func and skip the test if the HF model config is not available (e.g. offline CI)."""
     try:
-        return config_func()
+        config_func()
     except (ValueError, OSError) as e:

Or, if you want to avoid double instantiation, modify run_pretrain_vl_recipe_test to accept an optional pre-built config.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/test_groups/recipes/test_glm_45v_recipes_finetune.py`
around lines 25 - 33, The helper _skip_if_model_unavailable currently calls and
returns config_func() but callers ignore the return causing config_func() to run
twice; change the flow so the validated config is reused: make
_skip_if_model_unavailable return the built config (the config object from
config_func), then update the callers that currently call
_skip_if_model_unavailable(...) and later call run_pretrain_vl_recipe_test(...)
to capture the returned config and pass it into run_pretrain_vl_recipe_test via
a new optional parameter (e.g., prebuilt_config) so run_pretrain_vl_recipe_test
uses the provided config instead of reconstructing it; adjust
run_pretrain_vl_recipe_test signature to accept and prefer a prebuilt_config
when present and update its callers accordingly.
tests/functional_tests/test_groups/recipes/utils.py (1)

243-244: Checkpoint verification is unconditionally disabled for VL recipe tests.

Setting checkpoint.save and checkpoint.load to None means run_pretrain_vl_recipe_test will never verify checkpoint creation (the conditional on line 294 will always skip). This differs from run_pretrain_recipe_test which still verifies checkpoints.

If this is intentional to simplify VL tests (e.g., to avoid checkpoint overhead or because mock datasets don't need checkpoint validation), consider adding a comment explaining the rationale. Otherwise, this reduces test coverage compared to non-VL recipe tests.

📝 Suggested documentation
         # Set up output directories after instantiation
         run_output_dir = shared_base_dir / f"{recipe_name}_functional_test"
         tensorboard_dir = run_output_dir / "tb_logs"
+        # Disable checkpointing for VL recipe tests to reduce overhead;
+        # these tests focus on forward pass correctness with mock data.
         config.checkpoint.save = None
         config.checkpoint.load = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/test_groups/recipes/utils.py` around lines 243 - 244,
The VL recipe test currently unconditionally disables checkpoint verification by
setting config.checkpoint.save and config.checkpoint.load to None, which
prevents run_pretrain_vl_recipe_test from ever checking checkpoints (unlike
run_pretrain_recipe_test); either restore checkpoint behavior by removing or not
overriding those assignments so the existing conditional in
run_pretrain_vl_recipe_test can execute, or if disabling is intentional add a
clear comment above the assignments explaining why checkpoint verification is
skipped for VL tests (referencing config.checkpoint.save,
config.checkpoint.load, and run_pretrain_vl_recipe_test in the comment).
src/megatron/bridge/data/vlm_datasets/collate.py (2)

474-492: Consider adding pad_token fallback for consistency.

The nemotron_nano_v2_vl_collate_fn (lines 318-320) and default_collate_fn (lines 531-534) both set pad_token = eos_token as a fallback. For consistency and robustness, consider adding the same safeguard here before calling apply_chat_template with padding=True.

♻️ Proposed fix
 def glm4v_collate_fn(examples: list, processor) -> dict[str, torch.Tensor]:
     """Collate function for GLM-4.5V model.
     ...
     """
     skipped_tokens = extract_skipped_token_ids(processor)

+    # Ensure a pad_token is set so padding can produce uniform-length tensors.
+    if processor.tokenizer.pad_token is None:
+        processor.tokenizer.pad_token = processor.tokenizer.eos_token
+
     batch = processor.apply_chat_template(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/data/vlm_datasets/collate.py` around lines 474 - 492, The
glm4v_collate_fn should mirror the other collate functions by ensuring the
tokenizer has a pad_token fallback before calling processor.apply_chat_template;
check processor.tokenizer.pad_token (or processor.tokenizer.pad_token_id) and if
missing set it to processor.tokenizer.eos_token (or eos_token_id) so
padding=True won’t fail or produce inconsistent behavior—update glm4v_collate_fn
to perform this pad_token = eos_token fallback prior to the apply_chat_template
call (similar to nemotron_nano_v2_vl_collate_fn and default_collate_fn).

505-508: Consider adding strict=True to zip().

Per Ruff B905, adding strict=True ensures examples and batch["input_ids"] have matching lengths, catching potential mismatches early.

♻️ Proposed fix
     loss_masks = [
         create_multiturn_loss_mask_by_search(example, input_ids, processor, skipped_tokens)
-        for example, input_ids in zip(examples, batch["input_ids"])
+        for example, input_ids in zip(examples, batch["input_ids"], strict=True)
     ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/data/vlm_datasets/collate.py` around lines 505 - 508, The
list comprehension that builds loss_masks uses zip(examples, batch["input_ids"])
which can silently drop items if lengths differ; update that zip call to
zip(examples, batch["input_ids"], strict=True) so mismatched lengths raise an
error early; change the zip used in the loss_masks comprehension that calls
create_multiturn_loss_mask_by_search(example, input_ids, processor,
skipped_tokens) accordingly.
src/megatron/bridge/models/glm_vl/modeling_glm_45v.py (1)

236-236: Prefix unused variable with underscore.

rope_deltas is unpacked but never used. Prefix with _ to indicate intentional discard.

♻️ Proposed fix
-        position_ids, rope_deltas = self.get_rope_index(
+        position_ids, _rope_deltas = self.get_rope_index(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm_vl/modeling_glm_45v.py` at line 236, The
unpacked variable rope_deltas from the call to self.get_rope_index (the line
"position_ids, rope_deltas = self.get_rope_index(...)") is unused; rename it to
a throwaway name (e.g., _rope_deltas or _) to signal intentional discard and
silence linter warnings. Update that assignment in the same scope (where
position_ids is used) so only position_ids remains used while the second return
value is prefixed with an underscore.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional_tests/launch_scripts/active/L0_Launch_models_glm_vl.sh`:
- Around line 1-16: Move the shebang to the very first line of the script so the
kernel picks the correct interpreter: place "#!/bin/bash" as line 1, then follow
with the copyright header and the existing "set -xeuo pipefail" line; ensure
there are no blank lines or characters before the shebang and keep the rest of
the script unchanged.

In
`@tests/functional_tests/launch_scripts/active/L0_Launch_recipes_perf_config.sh`:
- Around line 1-15: The shebang is currently at the end of the header; move the
interpreter line (#!/bin/bash) to be the very first line of the script (no
preceding blank lines or comments) so the system recognizes the shell, then keep
the copyright header below it; update the top of
L0_Launch_recipes_perf_config.sh so the shebang is the first line and the rest
of the file content (copyright block and script body) follows.

In `@tests/functional_tests/test_groups/models/qwen/test_qwen3_peft_verify.py`:
- Around line 180-181: The subprocess.run invocations that create `result =
subprocess.run(cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env)`
(used at the three places invoking `subprocess.run` with `cmd`, `env`, and
`PROJECT_ROOT`) need a timeout to avoid CI hangs: add a `timeout=` argument
(e.g., 300 seconds) to each call and wrap the call in a try/except for
`subprocess.TimeoutExpired` to fail the test (or raise) with a clear message;
apply this change to all three occurrences where `subprocess.run` is used in
this file.

---

Outside diff comments:
In `@src/megatron/bridge/models/conversion/param_mapping.py`:
- Around line 2558-2595: Add a unit test that calls split_qkv_weights(provider,
qkv, feature_dim=<rank>) with a 2D LoRA-style tensor to exercise the feature_dim
branch: construct a TransformerConfig-like provider (set num_attention_heads,
num_query_groups, kv_channels or hidden_size, and attention_output_gate as
needed), compute qkv_total_dim and head_size per the same formula, create a 2D
qkv tensor of shape (qkv_total_dim * head_size, feature_dim) with known values,
call split_qkv_weights(..., feature_dim=feature_dim) and assert the returned Q,
K, V tensors have the expected shapes (qkv_total_dim/head splits × head_size ×
feature_dim or equivalent per attention_output_gate) and that their contents
match the corresponding slices from the original tensor to verify reshape/split
semantics.

---

Nitpick comments:
In `@src/megatron/bridge/data/vlm_datasets/collate.py`:
- Around line 474-492: The glm4v_collate_fn should mirror the other collate
functions by ensuring the tokenizer has a pad_token fallback before calling
processor.apply_chat_template; check processor.tokenizer.pad_token (or
processor.tokenizer.pad_token_id) and if missing set it to
processor.tokenizer.eos_token (or eos_token_id) so padding=True won’t fail or
produce inconsistent behavior—update glm4v_collate_fn to perform this pad_token
= eos_token fallback prior to the apply_chat_template call (similar to
nemotron_nano_v2_vl_collate_fn and default_collate_fn).
- Around line 505-508: The list comprehension that builds loss_masks uses
zip(examples, batch["input_ids"]) which can silently drop items if lengths
differ; update that zip call to zip(examples, batch["input_ids"], strict=True)
so mismatched lengths raise an error early; change the zip used in the
loss_masks comprehension that calls
create_multiturn_loss_mask_by_search(example, input_ids, processor,
skipped_tokens) accordingly.

In `@src/megatron/bridge/models/glm_vl/modeling_glm_45v.py`:
- Line 236: The unpacked variable rope_deltas from the call to
self.get_rope_index (the line "position_ids, rope_deltas =
self.get_rope_index(...)") is unused; rename it to a throwaway name (e.g.,
_rope_deltas or _) to signal intentional discard and silence linter warnings.
Update that assignment in the same scope (where position_ids is used) so only
position_ids remains used while the second return value is prefixed with an
underscore.

In `@tests/functional_tests/test_groups/recipes/test_glm_45v_recipes_finetune.py`:
- Around line 25-33: The helper _skip_if_model_unavailable currently calls and
returns config_func() but callers ignore the return causing config_func() to run
twice; change the flow so the validated config is reused: make
_skip_if_model_unavailable return the built config (the config object from
config_func), then update the callers that currently call
_skip_if_model_unavailable(...) and later call run_pretrain_vl_recipe_test(...)
to capture the returned config and pass it into run_pretrain_vl_recipe_test via
a new optional parameter (e.g., prebuilt_config) so run_pretrain_vl_recipe_test
uses the provided config instead of reconstructing it; adjust
run_pretrain_vl_recipe_test signature to accept and prefer a prebuilt_config
when present and update its callers accordingly.

In `@tests/functional_tests/test_groups/recipes/utils.py`:
- Around line 243-244: The VL recipe test currently unconditionally disables
checkpoint verification by setting config.checkpoint.save and
config.checkpoint.load to None, which prevents run_pretrain_vl_recipe_test from
ever checking checkpoints (unlike run_pretrain_recipe_test); either restore
checkpoint behavior by removing or not overriding those assignments so the
existing conditional in run_pretrain_vl_recipe_test can execute, or if disabling
is intentional add a clear comment above the assignments explaining why
checkpoint verification is skipped for VL tests (referencing
config.checkpoint.save, config.checkpoint.load, and run_pretrain_vl_recipe_test
in the comment).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 62f07f6e-3701-4ecc-bdf5-3af5c4e87fd9

📥 Commits

Reviewing files that changed from the base of the PR and between bde8388 and 1be2696.

📒 Files selected for processing (15)
  • src/megatron/bridge/data/vlm_datasets/collate.py
  • src/megatron/bridge/models/conversion/param_mapping.py
  • src/megatron/bridge/models/conversion/peft_bridge.py
  • src/megatron/bridge/models/glm_vl/modeling_glm_45v.py
  • src/megatron/bridge/training/utils/visual_inputs.py
  • tests/functional_tests/launch_scripts/active/L0_Launch_models_glm_vl.sh
  • tests/functional_tests/launch_scripts/active/L0_Launch_models_qwen3.sh
  • tests/functional_tests/launch_scripts/active/L0_Launch_recipes_deepseek.sh
  • tests/functional_tests/launch_scripts/active/L0_Launch_recipes_gemma.sh
  • tests/functional_tests/launch_scripts/active/L0_Launch_recipes_glm_vl.sh
  • tests/functional_tests/launch_scripts/active/L0_Launch_recipes_perf_config.sh
  • tests/functional_tests/test_groups/models/glm_vl/test_glm_45v_conversion.py
  • tests/functional_tests/test_groups/models/qwen/test_qwen3_peft_verify.py
  • tests/functional_tests/test_groups/recipes/test_glm_45v_recipes_finetune.py
  • tests/functional_tests/test_groups/recipes/utils.py

Comment on lines +1 to +16
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#!/bin/bash
set -xeuo pipefail # Exit immediately if a command exits with a non-zero status
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Shebang must be on the first line for the script to execute correctly.

The copyright header precedes the shebang (#!/bin/bash), which can cause the shell to misinterpret the script. The shebang must be on line 1 for proper interpreter selection.

🐛 Proposed fix
+#!/bin/bash
 # Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -12,7 +13,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-#!/bin/bash
 set -xeuo pipefail # Exit immediately if a command exits with a non-zero status
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/bin/bash
set -xeuo pipefail # Exit immediately if a command exits with a non-zero status
#!/bin/bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -xeuo pipefail # Exit immediately if a command exits with a non-zero status
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 15-15: The shebang must be on the first line. Delete blanks and move comments.

(SC1128)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/launch_scripts/active/L0_Launch_models_glm_vl.sh`
around lines 1 - 16, Move the shebang to the very first line of the script so
the kernel picks the correct interpreter: place "#!/bin/bash" as line 1, then
follow with the copyright header and the existing "set -xeuo pipefail" line;
ensure there are no blank lines or characters before the shebang and keep the
rest of the script unchanged.

Comment on lines +1 to +15
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#!/bin/bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Move shebang to first line.

The shebang (#!/bin/bash) must be on the first line of the script for the shell to recognize the interpreter. Currently it's on line 15 after the copyright header. This will cause execution failures on systems that rely on the shebang for interpreter selection.

Compare with the other launch scripts in this PR (e.g., L0_Launch_recipes_gemma.sh) which correctly place the shebang first.

🔧 Proposed fix
+#!/bin/bash
 # Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ ... @@
 # limitations under the License.
 
-#!/bin/bash
 set -xeuo pipefail # Exit immediately if a command exits with a non-zero status
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/bin/bash
#!/bin/bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 15-15: The shebang must be on the first line. Delete blanks and move comments.

(SC1128)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/functional_tests/launch_scripts/active/L0_Launch_recipes_perf_config.sh`
around lines 1 - 15, The shebang is currently at the end of the header; move the
interpreter line (#!/bin/bash) to be the very first line of the script (no
preceding blank lines or comments) so the system recognizes the shell, then keep
the copyright header below it; update the top of
L0_Launch_recipes_perf_config.sh so the shebang is the first line and the rest
of the file content (copyright block and script body) follows.

Comment on lines +180 to +181
env = {**os.environ, "MASTER_PORT": str(_find_free_port())}
result = subprocess.run(cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add timeouts to all subprocess invocations to prevent CI hangs.

Lines 181, 205, and 244 call subprocess.run(...) without timeout, so a stalled child process can block the job indefinitely.

⏱️ Proposed fix
+SUBPROCESS_TIMEOUT_SECONDS = 900
+
@@
-        result = subprocess.run(cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env)
+        result = subprocess.run(
+            cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env, timeout=SUBPROCESS_TIMEOUT_SECONDS
+        )
@@
-        result = subprocess.run(cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env)
+        result = subprocess.run(
+            cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env, timeout=SUBPROCESS_TIMEOUT_SECONDS
+        )
@@
-        result = subprocess.run(cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env)
+        result = subprocess.run(
+            cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env, timeout=SUBPROCESS_TIMEOUT_SECONDS
+        )

Also applies to: 204-205, 243-244

🧰 Tools
🪛 Ruff (0.15.9)

[error] 181-181: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/test_groups/models/qwen/test_qwen3_peft_verify.py`
around lines 180 - 181, The subprocess.run invocations that create `result =
subprocess.run(cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env)`
(used at the three places invoking `subprocess.run` with `cmd`, `env`, and
`PROJECT_ROOT`) need a timeout to avoid CI hangs: add a `timeout=` argument
(e.g., 300 seconds) to each call and wrap the call in a try/except for
`subprocess.TimeoutExpired` to fail the test (or raise) with a clear message;
apply this change to all three occurrences where `subprocess.run` is used in
this file.

@cuichenx cuichenx merged commit 77b77db into r0.4.0 Apr 4, 2026
51 of 52 checks passed
@cuichenx cuichenx deleted the cherry-pick-3070-r0.4.0 branch April 4, 2026 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants