Conversation
…3070) Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
|
/ok to test 1be2696 |
📝 WalkthroughWalkthroughThe PR adds support for GLM-4.5V vision-language model processing by implementing specialized collation functions, extending model signatures to handle multi-modal token type IDs, refining QKV weight-splitting logic with feature dimension handling, and introducing functional test infrastructure with proper test isolation. Changes
Sequence DiagramsequenceDiagram
participant DataLoader
participant CollateFunc as Collate Function<br/>(glm4v_collate_fn)
participant VisualInputs as GenericVisualInputs
participant Model as GLM45VModel.forward
participant RopeIndex as RoPE Position<br/>Computation
DataLoader->>CollateFunc: examples list
activate CollateFunc
CollateFunc->>CollateFunc: apply_chat_template<br/>with padding
CollateFunc->>CollateFunc: synthesize position_ids
CollateFunc->>CollateFunc: build loss_mask<br/>with -100 masking
CollateFunc->>VisualInputs: wrap visual tensors<br/>+ mm_token_type_ids
CollateFunc-->>DataLoader: batch dict
deactivate CollateFunc
DataLoader->>Model: input_ids, mm_token_type_ids,<br/>visual_inputs
activate Model
Model->>Model: pad mm_token_type_ids<br/>if needed
Model->>Model: construct hf_attention_mask<br/>from input_ids
Model->>RopeIndex: hf_attention_mask,<br/>mm_token_type_ids
activate RopeIndex
RopeIndex-->>Model: position_ids
deactivate RopeIndex
Model-->>DataLoader: model output
deactivate Model
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/megatron/bridge/models/conversion/param_mapping.py (1)
2558-2595:⚠️ Potential issue | 🟡 MinorAdd unit test coverage for the new
feature_dimparameter path.The new conditional branch (line 2592) handling explicit
feature_dimvalues is already used in production (peft_bridge.py:312for LoRA tensors) but lacks explicit test coverage. The existing test only covers the defaultfeature_dim=Nonecase. Add a unit test forsplit_qkv_weights(..., feature_dim=<rank>)with a 2D LoRA tensor to ensure the new reshape semantics work correctly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/conversion/param_mapping.py` around lines 2558 - 2595, Add a unit test that calls split_qkv_weights(provider, qkv, feature_dim=<rank>) with a 2D LoRA-style tensor to exercise the feature_dim branch: construct a TransformerConfig-like provider (set num_attention_heads, num_query_groups, kv_channels or hidden_size, and attention_output_gate as needed), compute qkv_total_dim and head_size per the same formula, create a 2D qkv tensor of shape (qkv_total_dim * head_size, feature_dim) with known values, call split_qkv_weights(..., feature_dim=feature_dim) and assert the returned Q, K, V tensors have the expected shapes (qkv_total_dim/head splits × head_size × feature_dim or equivalent per attention_output_gate) and that their contents match the corresponding slices from the original tensor to verify reshape/split semantics.
🧹 Nitpick comments (5)
tests/functional_tests/test_groups/recipes/test_glm_45v_recipes_finetune.py (1)
25-33: Consider returning the config to avoid duplicate instantiation.The helper returns
config_func()but the callers on lines 83 and 94 discard the return value. This meansconfig_func()is called twice: once here for the availability check and again insiderun_pretrain_vl_recipe_test. For expensive config functions (e.g., those that download HF configs), this doubles the overhead.♻️ Proposed optimization
Either modify the helper to not return the value (making the intent clearer):
def _skip_if_model_unavailable(config_func): """Call config_func and skip the test if the HF model config is not available (e.g. offline CI).""" try: - return config_func() + config_func() except (ValueError, OSError) as e:Or, if you want to avoid double instantiation, modify
run_pretrain_vl_recipe_testto accept an optional pre-built config.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/recipes/test_glm_45v_recipes_finetune.py` around lines 25 - 33, The helper _skip_if_model_unavailable currently calls and returns config_func() but callers ignore the return causing config_func() to run twice; change the flow so the validated config is reused: make _skip_if_model_unavailable return the built config (the config object from config_func), then update the callers that currently call _skip_if_model_unavailable(...) and later call run_pretrain_vl_recipe_test(...) to capture the returned config and pass it into run_pretrain_vl_recipe_test via a new optional parameter (e.g., prebuilt_config) so run_pretrain_vl_recipe_test uses the provided config instead of reconstructing it; adjust run_pretrain_vl_recipe_test signature to accept and prefer a prebuilt_config when present and update its callers accordingly.tests/functional_tests/test_groups/recipes/utils.py (1)
243-244: Checkpoint verification is unconditionally disabled for VL recipe tests.Setting
checkpoint.saveandcheckpoint.loadtoNonemeansrun_pretrain_vl_recipe_testwill never verify checkpoint creation (the conditional on line 294 will always skip). This differs fromrun_pretrain_recipe_testwhich still verifies checkpoints.If this is intentional to simplify VL tests (e.g., to avoid checkpoint overhead or because mock datasets don't need checkpoint validation), consider adding a comment explaining the rationale. Otherwise, this reduces test coverage compared to non-VL recipe tests.
📝 Suggested documentation
# Set up output directories after instantiation run_output_dir = shared_base_dir / f"{recipe_name}_functional_test" tensorboard_dir = run_output_dir / "tb_logs" + # Disable checkpointing for VL recipe tests to reduce overhead; + # these tests focus on forward pass correctness with mock data. config.checkpoint.save = None config.checkpoint.load = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/recipes/utils.py` around lines 243 - 244, The VL recipe test currently unconditionally disables checkpoint verification by setting config.checkpoint.save and config.checkpoint.load to None, which prevents run_pretrain_vl_recipe_test from ever checking checkpoints (unlike run_pretrain_recipe_test); either restore checkpoint behavior by removing or not overriding those assignments so the existing conditional in run_pretrain_vl_recipe_test can execute, or if disabling is intentional add a clear comment above the assignments explaining why checkpoint verification is skipped for VL tests (referencing config.checkpoint.save, config.checkpoint.load, and run_pretrain_vl_recipe_test in the comment).src/megatron/bridge/data/vlm_datasets/collate.py (2)
474-492: Consider adding pad_token fallback for consistency.The
nemotron_nano_v2_vl_collate_fn(lines 318-320) anddefault_collate_fn(lines 531-534) both setpad_token = eos_tokenas a fallback. For consistency and robustness, consider adding the same safeguard here before callingapply_chat_templatewithpadding=True.♻️ Proposed fix
def glm4v_collate_fn(examples: list, processor) -> dict[str, torch.Tensor]: """Collate function for GLM-4.5V model. ... """ skipped_tokens = extract_skipped_token_ids(processor) + # Ensure a pad_token is set so padding can produce uniform-length tensors. + if processor.tokenizer.pad_token is None: + processor.tokenizer.pad_token = processor.tokenizer.eos_token + batch = processor.apply_chat_template(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/data/vlm_datasets/collate.py` around lines 474 - 492, The glm4v_collate_fn should mirror the other collate functions by ensuring the tokenizer has a pad_token fallback before calling processor.apply_chat_template; check processor.tokenizer.pad_token (or processor.tokenizer.pad_token_id) and if missing set it to processor.tokenizer.eos_token (or eos_token_id) so padding=True won’t fail or produce inconsistent behavior—update glm4v_collate_fn to perform this pad_token = eos_token fallback prior to the apply_chat_template call (similar to nemotron_nano_v2_vl_collate_fn and default_collate_fn).
505-508: Consider addingstrict=Trueto zip().Per Ruff B905, adding
strict=Trueensuresexamplesandbatch["input_ids"]have matching lengths, catching potential mismatches early.♻️ Proposed fix
loss_masks = [ create_multiturn_loss_mask_by_search(example, input_ids, processor, skipped_tokens) - for example, input_ids in zip(examples, batch["input_ids"]) + for example, input_ids in zip(examples, batch["input_ids"], strict=True) ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/data/vlm_datasets/collate.py` around lines 505 - 508, The list comprehension that builds loss_masks uses zip(examples, batch["input_ids"]) which can silently drop items if lengths differ; update that zip call to zip(examples, batch["input_ids"], strict=True) so mismatched lengths raise an error early; change the zip used in the loss_masks comprehension that calls create_multiturn_loss_mask_by_search(example, input_ids, processor, skipped_tokens) accordingly.src/megatron/bridge/models/glm_vl/modeling_glm_45v.py (1)
236-236: Prefix unused variable with underscore.
rope_deltasis unpacked but never used. Prefix with_to indicate intentional discard.♻️ Proposed fix
- position_ids, rope_deltas = self.get_rope_index( + position_ids, _rope_deltas = self.get_rope_index(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/glm_vl/modeling_glm_45v.py` at line 236, The unpacked variable rope_deltas from the call to self.get_rope_index (the line "position_ids, rope_deltas = self.get_rope_index(...)") is unused; rename it to a throwaway name (e.g., _rope_deltas or _) to signal intentional discard and silence linter warnings. Update that assignment in the same scope (where position_ids is used) so only position_ids remains used while the second return value is prefixed with an underscore.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/functional_tests/launch_scripts/active/L0_Launch_models_glm_vl.sh`:
- Around line 1-16: Move the shebang to the very first line of the script so the
kernel picks the correct interpreter: place "#!/bin/bash" as line 1, then follow
with the copyright header and the existing "set -xeuo pipefail" line; ensure
there are no blank lines or characters before the shebang and keep the rest of
the script unchanged.
In
`@tests/functional_tests/launch_scripts/active/L0_Launch_recipes_perf_config.sh`:
- Around line 1-15: The shebang is currently at the end of the header; move the
interpreter line (#!/bin/bash) to be the very first line of the script (no
preceding blank lines or comments) so the system recognizes the shell, then keep
the copyright header below it; update the top of
L0_Launch_recipes_perf_config.sh so the shebang is the first line and the rest
of the file content (copyright block and script body) follows.
In `@tests/functional_tests/test_groups/models/qwen/test_qwen3_peft_verify.py`:
- Around line 180-181: The subprocess.run invocations that create `result =
subprocess.run(cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env)`
(used at the three places invoking `subprocess.run` with `cmd`, `env`, and
`PROJECT_ROOT`) need a timeout to avoid CI hangs: add a `timeout=` argument
(e.g., 300 seconds) to each call and wrap the call in a try/except for
`subprocess.TimeoutExpired` to fail the test (or raise) with a clear message;
apply this change to all three occurrences where `subprocess.run` is used in
this file.
---
Outside diff comments:
In `@src/megatron/bridge/models/conversion/param_mapping.py`:
- Around line 2558-2595: Add a unit test that calls split_qkv_weights(provider,
qkv, feature_dim=<rank>) with a 2D LoRA-style tensor to exercise the feature_dim
branch: construct a TransformerConfig-like provider (set num_attention_heads,
num_query_groups, kv_channels or hidden_size, and attention_output_gate as
needed), compute qkv_total_dim and head_size per the same formula, create a 2D
qkv tensor of shape (qkv_total_dim * head_size, feature_dim) with known values,
call split_qkv_weights(..., feature_dim=feature_dim) and assert the returned Q,
K, V tensors have the expected shapes (qkv_total_dim/head splits × head_size ×
feature_dim or equivalent per attention_output_gate) and that their contents
match the corresponding slices from the original tensor to verify reshape/split
semantics.
---
Nitpick comments:
In `@src/megatron/bridge/data/vlm_datasets/collate.py`:
- Around line 474-492: The glm4v_collate_fn should mirror the other collate
functions by ensuring the tokenizer has a pad_token fallback before calling
processor.apply_chat_template; check processor.tokenizer.pad_token (or
processor.tokenizer.pad_token_id) and if missing set it to
processor.tokenizer.eos_token (or eos_token_id) so padding=True won’t fail or
produce inconsistent behavior—update glm4v_collate_fn to perform this pad_token
= eos_token fallback prior to the apply_chat_template call (similar to
nemotron_nano_v2_vl_collate_fn and default_collate_fn).
- Around line 505-508: The list comprehension that builds loss_masks uses
zip(examples, batch["input_ids"]) which can silently drop items if lengths
differ; update that zip call to zip(examples, batch["input_ids"], strict=True)
so mismatched lengths raise an error early; change the zip used in the
loss_masks comprehension that calls
create_multiturn_loss_mask_by_search(example, input_ids, processor,
skipped_tokens) accordingly.
In `@src/megatron/bridge/models/glm_vl/modeling_glm_45v.py`:
- Line 236: The unpacked variable rope_deltas from the call to
self.get_rope_index (the line "position_ids, rope_deltas =
self.get_rope_index(...)") is unused; rename it to a throwaway name (e.g.,
_rope_deltas or _) to signal intentional discard and silence linter warnings.
Update that assignment in the same scope (where position_ids is used) so only
position_ids remains used while the second return value is prefixed with an
underscore.
In `@tests/functional_tests/test_groups/recipes/test_glm_45v_recipes_finetune.py`:
- Around line 25-33: The helper _skip_if_model_unavailable currently calls and
returns config_func() but callers ignore the return causing config_func() to run
twice; change the flow so the validated config is reused: make
_skip_if_model_unavailable return the built config (the config object from
config_func), then update the callers that currently call
_skip_if_model_unavailable(...) and later call run_pretrain_vl_recipe_test(...)
to capture the returned config and pass it into run_pretrain_vl_recipe_test via
a new optional parameter (e.g., prebuilt_config) so run_pretrain_vl_recipe_test
uses the provided config instead of reconstructing it; adjust
run_pretrain_vl_recipe_test signature to accept and prefer a prebuilt_config
when present and update its callers accordingly.
In `@tests/functional_tests/test_groups/recipes/utils.py`:
- Around line 243-244: The VL recipe test currently unconditionally disables
checkpoint verification by setting config.checkpoint.save and
config.checkpoint.load to None, which prevents run_pretrain_vl_recipe_test from
ever checking checkpoints (unlike run_pretrain_recipe_test); either restore
checkpoint behavior by removing or not overriding those assignments so the
existing conditional in run_pretrain_vl_recipe_test can execute, or if disabling
is intentional add a clear comment above the assignments explaining why
checkpoint verification is skipped for VL tests (referencing
config.checkpoint.save, config.checkpoint.load, and run_pretrain_vl_recipe_test
in the comment).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 62f07f6e-3701-4ecc-bdf5-3af5c4e87fd9
📒 Files selected for processing (15)
src/megatron/bridge/data/vlm_datasets/collate.pysrc/megatron/bridge/models/conversion/param_mapping.pysrc/megatron/bridge/models/conversion/peft_bridge.pysrc/megatron/bridge/models/glm_vl/modeling_glm_45v.pysrc/megatron/bridge/training/utils/visual_inputs.pytests/functional_tests/launch_scripts/active/L0_Launch_models_glm_vl.shtests/functional_tests/launch_scripts/active/L0_Launch_models_qwen3.shtests/functional_tests/launch_scripts/active/L0_Launch_recipes_deepseek.shtests/functional_tests/launch_scripts/active/L0_Launch_recipes_gemma.shtests/functional_tests/launch_scripts/active/L0_Launch_recipes_glm_vl.shtests/functional_tests/launch_scripts/active/L0_Launch_recipes_perf_config.shtests/functional_tests/test_groups/models/glm_vl/test_glm_45v_conversion.pytests/functional_tests/test_groups/models/qwen/test_qwen3_peft_verify.pytests/functional_tests/test_groups/recipes/test_glm_45v_recipes_finetune.pytests/functional_tests/test_groups/recipes/utils.py
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| #!/bin/bash | ||
| set -xeuo pipefail # Exit immediately if a command exits with a non-zero status |
There was a problem hiding this comment.
Shebang must be on the first line for the script to execute correctly.
The copyright header precedes the shebang (#!/bin/bash), which can cause the shell to misinterpret the script. The shebang must be on line 1 for proper interpreter selection.
🐛 Proposed fix
+#!/bin/bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -12,7 +13,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-#!/bin/bash
set -xeuo pipefail # Exit immediately if a command exits with a non-zero status📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| #!/bin/bash | |
| set -xeuo pipefail # Exit immediately if a command exits with a non-zero status | |
| #!/bin/bash | |
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| set -xeuo pipefail # Exit immediately if a command exits with a non-zero status |
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 15-15: The shebang must be on the first line. Delete blanks and move comments.
(SC1128)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/launch_scripts/active/L0_Launch_models_glm_vl.sh`
around lines 1 - 16, Move the shebang to the very first line of the script so
the kernel picks the correct interpreter: place "#!/bin/bash" as line 1, then
follow with the copyright header and the existing "set -xeuo pipefail" line;
ensure there are no blank lines or characters before the shebang and keep the
rest of the script unchanged.
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| #!/bin/bash |
There was a problem hiding this comment.
Move shebang to first line.
The shebang (#!/bin/bash) must be on the first line of the script for the shell to recognize the interpreter. Currently it's on line 15 after the copyright header. This will cause execution failures on systems that rely on the shebang for interpreter selection.
Compare with the other launch scripts in this PR (e.g., L0_Launch_recipes_gemma.sh) which correctly place the shebang first.
🔧 Proposed fix
+#!/bin/bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@@ ... @@
# limitations under the License.
-#!/bin/bash
set -xeuo pipefail # Exit immediately if a command exits with a non-zero status📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| #!/bin/bash | |
| #!/bin/bash | |
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. |
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 15-15: The shebang must be on the first line. Delete blanks and move comments.
(SC1128)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@tests/functional_tests/launch_scripts/active/L0_Launch_recipes_perf_config.sh`
around lines 1 - 15, The shebang is currently at the end of the header; move the
interpreter line (#!/bin/bash) to be the very first line of the script (no
preceding blank lines or comments) so the system recognizes the shell, then keep
the copyright header below it; update the top of
L0_Launch_recipes_perf_config.sh so the shebang is the first line and the rest
of the file content (copyright block and script body) follows.
| env = {**os.environ, "MASTER_PORT": str(_find_free_port())} | ||
| result = subprocess.run(cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env) |
There was a problem hiding this comment.
Add timeouts to all subprocess invocations to prevent CI hangs.
Lines 181, 205, and 244 call subprocess.run(...) without timeout, so a stalled child process can block the job indefinitely.
⏱️ Proposed fix
+SUBPROCESS_TIMEOUT_SECONDS = 900
+
@@
- result = subprocess.run(cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env)
+ result = subprocess.run(
+ cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env, timeout=SUBPROCESS_TIMEOUT_SECONDS
+ )
@@
- result = subprocess.run(cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env)
+ result = subprocess.run(
+ cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env, timeout=SUBPROCESS_TIMEOUT_SECONDS
+ )
@@
- result = subprocess.run(cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env)
+ result = subprocess.run(
+ cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env, timeout=SUBPROCESS_TIMEOUT_SECONDS
+ )Also applies to: 204-205, 243-244
🧰 Tools
🪛 Ruff (0.15.9)
[error] 181-181: subprocess call: check for execution of untrusted input
(S603)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/test_groups/models/qwen/test_qwen3_peft_verify.py`
around lines 180 - 181, The subprocess.run invocations that create `result =
subprocess.run(cmd, capture_output=True, text=True, cwd=PROJECT_ROOT, env=env)`
(used at the three places invoking `subprocess.run` with `cmd`, `env`, and
`PROJECT_ROOT`) need a timeout to avoid CI hangs: add a `timeout=` argument
(e.g., 300 seconds) to each call and wrap the call in a try/except for
`subprocess.TimeoutExpired` to fail the test (or raise) with a clear message;
apply this change to all three occurrences where `subprocess.run` is used in
this file.
beep boop [🤖]: Hi @cuichenx 👋,
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests