[DO NOT MERGE]: Enhance OpenMPI-CUDA configuration#145
[DO NOT MERGE]: Enhance OpenMPI-CUDA configuration#145kapil27 wants to merge 3 commits intoopendatahub-io:mainfrom
Conversation
- Added a new image reference for OpenMPI-CUDA in params.env. - Updated kustomization.yaml to include image replacement for OpenMPI-CUDA in ClusterTrainingRuntime. - Included openmpi_cuda.yaml in the runtimes kustomization. This update ensures the correct image is used for OpenMPI-CUDA deployments.
|
Important Review skippedIgnore keyword(s) in the title. ⛔ Ignored keywords (2)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Central YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds an OpenMPI CUDA ClusterTrainingRuntime named Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Actionable Issues
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@manifests/rhoai/params.env`:
- Line 7: The odh-openmpi-cuda-image entry uses a mutable tag
("odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda:0.0.14"); replace this with
the image's immutable digest (e.g., quay.io/ksuta/odh-mpi-cuda@sha256:<digest>)
to prevent mutable-tag risks, update any manifests or helm values that consume
odh-openmpi-cuda-image to use the digest form, and verify the chosen sha256
digest by pulling or inspecting the registry before committing.
In `@manifests/rhoai/runtimes/openmpi_cuda.yaml`:
- Around line 42-46: The manifest currently only specifies GPU resources
(nvidia.com/gpu) in the resources.limits/requests blocks; add explicit CPU and
memory requests and limits for both replicated job specs so the scheduler can
make proper placement and eviction decisions. Update the resources section
alongside the existing nvidia.com/gpu entries (the resources.limits and
resources.requests blocks in the openmpi_cuda.yaml job/container specs) to
include something like cpu and memory limit and request values appropriate for
the workload, and apply the same additions to the second occurrence referenced
in the file so both replicated jobs have matching CPU/memory resource
boundaries.
- Around line 34-46: The container specs for the workload (container name "node"
using image quay.io/ksuta/odh-mpi-cuda:0.0.14 and the sshd container mentioned)
lack an explicit securityContext; add a securityContext block to each workload
container (including the "node" and the sshd container) setting runAsNonRoot:
true, runAsUser to a non-root UID, readOnlyRootFilesystem: true,
allowPrivilegeEscalation: false, noNewPrivileges: true, drop all capabilities
(and only add minimal needed if any), and include a seccompProfile
(RuntimeDefault) to harden the container and prevent privilege escalation and
lateral movement. Ensure the same pattern is applied to both container
definitions referenced in the diff.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 019092a5-8260-4213-a873-71c10e229541
📒 Files selected for processing (4)
manifests/rhoai/kustomization.yamlmanifests/rhoai/params.envmanifests/rhoai/runtimes/kustomization.yamlmanifests/rhoai/runtimes/openmpi_cuda.yaml
| odh-th06-cuda130-torch291-py312-image=quay.io/opendatahub/odh-th06-cuda130-torch291-py312:odh-3.4-ea2 | ||
| odh-th06-rocm64-torch291-py312-image=quay.io/opendatahub/odh-th06-rocm64-torch291-py312:odh-3.4-ea2 | ||
| odh-th06-cpu-torch291-py312-image=quay.io/opendatahub/odh-th06-cpu-torch291-py312:odh-3.4-ea2 | ||
| odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda:0.0.14 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify mutable tag usage for this image in the changed manifests.
rg -n 'quay\.io/ksuta/odh-mpi-cuda:[^ @]+' manifests/rhoai/params.env manifests/rhoai/runtimes/openmpi_cuda.yaml
# Expected after fix: no matches above; only `@sha256` form remains.
rg -n 'quay\.io/ksuta/odh-mpi-cuda@sha256:' manifests/rhoai/params.env manifests/rhoai/runtimes/openmpi_cuda.yamlRepository: opendatahub-io/trainer
Length of output: 369
Pin the image to an immutable digest (CWE-494, Severity: Major).
Line 7 uses a mutable tag. If 0.0.14 is re-pushed or the registry account is compromised, newly scheduled pods pull attacker-controlled content.
Remediation patch
-odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda:0.0.14
+odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda@sha256:<verified_digest_for_0.0.14>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@manifests/rhoai/params.env` at line 7, The odh-openmpi-cuda-image entry uses
a mutable tag ("odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda:0.0.14");
replace this with the image's immutable digest (e.g.,
quay.io/ksuta/odh-mpi-cuda@sha256:<digest>) to prevent mutable-tag risks, update
any manifests or helm values that consume odh-openmpi-cuda-image to use the
digest form, and verify the chosen sha256 digest by pulling or inspecting the
registry before committing.
There was a problem hiding this comment.
Thanks. We’re keeping this as-is for now because this image has not yet been onboarded to the managed ODH/MODH registry flow that the other runtime images use. Once that onboarding is complete, we plan to align it with the existing image management pattern, including using the managed registry reference rather than treating this temporary external image reference as final.
There was a problem hiding this comment.
@kapil27, understood — that context makes sense. I'll note this for future reviews so the mutable tag isn't re-flagged while the onboarding is in progress.
Would you like me to open a follow-up GitHub issue to track the onboarding of quay.io/ksuta/odh-mpi-cuda to the managed ODH/MODH registry and the subsequent switch to an immutable digest reference?
📝 Learnings were identified and not saved (knowledge base disabled). Enable
| resources: | ||
| limits: | ||
| nvidia.com/gpu: "1" | ||
| requests: | ||
| nvidia.com/gpu: "1" |
There was a problem hiding this comment.
Define CPU/memory requests and limits for both replicated jobs (Severity: Major).
Lines 42-46 and 73-77 set GPU resources only. Without CPU/memory bounds, scheduling and eviction behavior is unstable under contention, and noisy-neighbor impact increases.
Remediation patch
resources:
limits:
+ cpu: "4"
+ memory: "16Gi"
nvidia.com/gpu: "1"
requests:
+ cpu: "2"
+ memory: "8Gi"
nvidia.com/gpu: "1"
@@
resources:
limits:
+ cpu: "4"
+ memory: "16Gi"
nvidia.com/gpu: "1"
requests:
+ cpu: "2"
+ memory: "8Gi"
nvidia.com/gpu: "1"As per coding guidelines, **: REVIEW PRIORITIES: 4. Performance problems.
Also applies to: 73-77
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@manifests/rhoai/runtimes/openmpi_cuda.yaml` around lines 42 - 46, The
manifest currently only specifies GPU resources (nvidia.com/gpu) in the
resources.limits/requests blocks; add explicit CPU and memory requests and
limits for both replicated job specs so the scheduler can make proper placement
and eviction decisions. Update the resources section alongside the existing
nvidia.com/gpu entries (the resources.limits and resources.requests blocks in
the openmpi_cuda.yaml job/container specs) to include something like cpu and
memory limit and request values appropriate for the workload, and apply the same
additions to the second occurrence referenced in the file so both replicated
jobs have matching CPU/memory resource boundaries.
There was a problem hiding this comment.
Thanks. I’m keeping CPU and memory unspecified in this runtime for now. The GPU request is the resource requirement that is functionally necessary for this MPI runtime, while CPU and memory sizing is workload-dependent and can vary significantly by training job. Hardcoding guessed CPU/memory values at the runtime layer would make this manifest less reusable without clear evidence that these defaults are appropriate.
There was a problem hiding this comment.
@kapil27, understood — that's a reasonable design call. ClusterTrainingRuntime serves as a reusable base template, and deferring CPU/memory sizing to the individual TrainJob level keeps the runtime flexible across different workload profiles. Acknowledged.
📝 Learnings were identified and not saved (knowledge base disabled). Enable
Signed-off-by: Kapil Nema <knema@redhat.com>
| - name: OMPI_MCA_mtl | ||
| value: ^ofi | ||
| - name: OMPI_MCA_btl | ||
| value: tcp,self |
There was a problem hiding this comment.
IMHO we can remove these env vars, they could be provided in TrainJob. I have provided them for testing purposes for environments without RDMA. WDYT?
There was a problem hiding this comment.
Agree, These variables can be provided by user via trainjob, removed the variables to follow generic runtime defination.
- Eliminated OMPI_MCA_mtl and OMPI_MCA_btl environment variables from the node and sshd containers in openmpi_cuda.yaml. - This streamlines the configuration and reduces potential conflicts in runtime settings.
This update ensures the correct image is used for OpenMPI-CUDA deployments.
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #
Checklist:
Summary by CodeRabbit