[DO NOT MERGE]: Enhance OpenMPI-CUDA configuration by kapil27 · Pull Request #145 · opendatahub-io/trainer

kapil27 · 2026-04-13T07:35:16Z

Added a new image reference for OpenMPI-CUDA in params.env.
Updated kustomization.yaml to include image replacement for OpenMPI-CUDA in ClusterTrainingRuntime.
Included openmpi_cuda.yaml in the runtimes kustomization.

This update ensures the correct image is used for OpenMPI-CUDA deployments.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

Summary by CodeRabbit

New Features
- A new OpenMPI CUDA training runtime is available for GPU-accelerated distributed training, with SSH-based node communication, GPU resource requests, and a readiness probe for node health.
- Configuration updated to include and deploy the new runtime image (quay.io/ksuta/odh-mpi-cuda:0.0.14).

- Added a new image reference for OpenMPI-CUDA in params.env. - Updated kustomization.yaml to include image replacement for OpenMPI-CUDA in ClusterTrainingRuntime. - Included openmpi_cuda.yaml in the runtimes kustomization. This update ensures the correct image is used for OpenMPI-CUDA deployments.

coderabbitai · 2026-04-13T07:35:35Z

Important

Review skipped

Ignore keyword(s) in the title.

⛔ Ignored keywords (2)

WIP
DO NOT MERGE

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: aa70ee5a-3a19-4a7e-b402-e546c03219ff

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an OpenMPI CUDA ClusterTrainingRuntime named openmpi-cuda (new manifest), registers it in runtimes kustomization, adds an image param odh-openmpi-cuda-image set to quay.io/ksuta/odh-mpi-cuda:0.0.14 in params.env, and inserts an image replacement rule in the top-level kustomization to source that image from the rhoai-config ConfigMap and replace two container image fields in the runtime’s replicated jobs.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Actionable Issues

openmpi_cuda.yaml — launcher and node containers lack securityContext constraints (e.g., runAsNonRoot: true, allowPrivilegeEscalation: false, readOnlyRootFilesystem: true). Verify mpiuser UID/GID and add explicit restrictions. CWE-276, CWE-250.
openmpi_cuda.yaml — no imagePullPolicy set on containers. Explicitly set imagePullPolicy: IfNotPresent or Always to avoid unexpected caching behavior.
manifests/rhoai/params.env — image uses a mutable tag from an external registry (quay.io/ksuta/...:0.0.14) without pinned digest. Pin by digest (@sha256:...) to ensure image integrity and provenance. CWE-494.
openmpi_cuda.yaml — launcher replica has no resource requests/limits defined. Add resources.requests/limits (CPU/memory) consistent with node replica to prevent scheduler/resource contention.
openmpi_cuda.yaml — sshd config path (/etc/ssh/sshd_config_mpi) and invocation of sshd assumed present in image. Confirm the image contains this file and that SSH keys/permissions are correctly provisioned; avoid embedding secrets in plain files. CWE-276.
openmpi_cuda.yaml — containers request GPU (nvidia.com/gpu: 1) but do not specify tolerations or nodeSelector/affinity for GPU nodes. Ensure scheduling constraints are present where necessary.

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title '[DO NOT MERGE]: Enhance OpenMPI-CUDA configuration' is misleading and contradicts the PR's actual intent; the '[DO NOT MERGE]' prefix suggests blocking status while the PR is open and contains substantive changes ready for review.	Remove the '[DO NOT MERGE]' prefix. Use 'Add OpenMPI-CUDA runtime configuration' or similar to accurately reflect the addition of runtime manifests, kustomization rules, and image parameters.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@manifests/rhoai/params.env`:
- Line 7: The odh-openmpi-cuda-image entry uses a mutable tag
("odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda:0.0.14"); replace this with
the image's immutable digest (e.g., quay.io/ksuta/odh-mpi-cuda@sha256:<digest>)
to prevent mutable-tag risks, update any manifests or helm values that consume
odh-openmpi-cuda-image to use the digest form, and verify the chosen sha256
digest by pulling or inspecting the registry before committing.

In `@manifests/rhoai/runtimes/openmpi_cuda.yaml`:
- Around line 42-46: The manifest currently only specifies GPU resources
(nvidia.com/gpu) in the resources.limits/requests blocks; add explicit CPU and
memory requests and limits for both replicated job specs so the scheduler can
make proper placement and eviction decisions. Update the resources section
alongside the existing nvidia.com/gpu entries (the resources.limits and
resources.requests blocks in the openmpi_cuda.yaml job/container specs) to
include something like cpu and memory limit and request values appropriate for
the workload, and apply the same additions to the second occurrence referenced
in the file so both replicated jobs have matching CPU/memory resource
boundaries.
- Around line 34-46: The container specs for the workload (container name "node"
using image quay.io/ksuta/odh-mpi-cuda:0.0.14 and the sshd container mentioned)
lack an explicit securityContext; add a securityContext block to each workload
container (including the "node" and the sshd container) setting runAsNonRoot:
true, runAsUser to a non-root UID, readOnlyRootFilesystem: true,
allowPrivilegeEscalation: false, noNewPrivileges: true, drop all capabilities
(and only add minimal needed if any), and include a seccompProfile
(RuntimeDefault) to harden the container and prevent privilege escalation and
lateral movement. Ensure the same pattern is applied to both container
definitions referenced in the diff.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 019092a5-8260-4213-a873-71c10e229541

📥 Commits

Reviewing files that changed from the base of the PR and between c2c1ebc and 0f8b08c.

📒 Files selected for processing (4)

manifests/rhoai/kustomization.yaml
manifests/rhoai/params.env
manifests/rhoai/runtimes/kustomization.yaml
manifests/rhoai/runtimes/openmpi_cuda.yaml

coderabbitai · 2026-04-13T07:43:11Z

 odh-th06-cuda130-torch291-py312-image=quay.io/opendatahub/odh-th06-cuda130-torch291-py312:odh-3.4-ea2
 odh-th06-rocm64-torch291-py312-image=quay.io/opendatahub/odh-th06-rocm64-torch291-py312:odh-3.4-ea2
 odh-th06-cpu-torch291-py312-image=quay.io/opendatahub/odh-th06-cpu-torch291-py312:odh-3.4-ea2
+odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda:0.0.14


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify mutable tag usage for this image in the changed manifests. rg -n 'quay\.io/ksuta/odh-mpi-cuda:[^ @]+' manifests/rhoai/params.env manifests/rhoai/runtimes/openmpi_cuda.yaml # Expected after fix: no matches above; only `@sha256` form remains. rg -n 'quay\.io/ksuta/odh-mpi-cuda@sha256:' manifests/rhoai/params.env manifests/rhoai/runtimes/openmpi_cuda.yaml

Repository: opendatahub-io/trainer

Length of output: 369

Pin the image to an immutable digest (CWE-494, Severity: Major).

Line 7 uses a mutable tag. If 0.0.14 is re-pushed or the registry account is compromised, newly scheduled pods pull attacker-controlled content.

Remediation patch

-odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda:0.0.14 +odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda@sha256:<verified_digest_for_0.0.14>

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@manifests/rhoai/params.env` at line 7, The odh-openmpi-cuda-image entry uses a mutable tag ("odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda:0.0.14"); replace this with the image's immutable digest (e.g., quay.io/ksuta/odh-mpi-cuda@sha256:<digest>) to prevent mutable-tag risks, update any manifests or helm values that consume odh-openmpi-cuda-image to use the digest form, and verify the chosen sha256 digest by pulling or inspecting the registry before committing.

Thanks. We’re keeping this as-is for now because this image has not yet been onboarded to the managed ODH/MODH registry flow that the other runtime images use. Once that onboarding is complete, we plan to align it with the existing image management pattern, including using the managed registry reference rather than treating this temporary external image reference as final.

@kapil27, understood — that context makes sense. I'll note this for future reviews so the mutable tag isn't re-flagged while the onboarding is in progress.

Would you like me to open a follow-up GitHub issue to track the onboarding of quay.io/ksuta/odh-mpi-cuda to the managed ODH/MODH registry and the subsequent switch to an immutable digest reference?

📝 Learnings were identified and not saved (knowledge base disabled). Enable

coderabbitai · 2026-04-13T07:43:12Z

+                      resources:
+                        limits:
+                          nvidia.com/gpu: "1"
+                        requests:
+                          nvidia.com/gpu: "1"


⚠️ Potential issue | 🟠 Major

Define CPU/memory requests and limits for both replicated jobs (Severity: Major).

Lines 42-46 and 73-77 set GPU resources only. Without CPU/memory bounds, scheduling and eviction behavior is unstable under contention, and noisy-neighbor impact increases.

Remediation patch

resources: limits: + cpu: "4" + memory: "16Gi" nvidia.com/gpu: "1" requests: + cpu: "2" + memory: "8Gi" nvidia.com/gpu: "1" @@ resources: limits: + cpu: "4" + memory: "16Gi" nvidia.com/gpu: "1" requests: + cpu: "2" + memory: "8Gi" nvidia.com/gpu: "1"

As per coding guidelines, **: REVIEW PRIORITIES: 4. Performance problems.

Also applies to: 73-77

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@manifests/rhoai/runtimes/openmpi_cuda.yaml` around lines 42 - 46, The manifest currently only specifies GPU resources (nvidia.com/gpu) in the resources.limits/requests blocks; add explicit CPU and memory requests and limits for both replicated job specs so the scheduler can make proper placement and eviction decisions. Update the resources section alongside the existing nvidia.com/gpu entries (the resources.limits and resources.requests blocks in the openmpi_cuda.yaml job/container specs) to include something like cpu and memory limit and request values appropriate for the workload, and apply the same additions to the second occurrence referenced in the file so both replicated jobs have matching CPU/memory resource boundaries.

Thanks. I’m keeping CPU and memory unspecified in this runtime for now. The GPU request is the resource requirement that is functionally necessary for this MPI runtime, while CPU and memory sizing is workload-dependent and can vary significantly by training job. Hardcoding guessed CPU/memory values at the runtime layer would make this manifest less reusable without clear evidence that these defaults are appropriate.

@kapil27, understood — that's a reasonable design call. ClusterTrainingRuntime serves as a reusable base template, and deferring CPU/memory sizing to the individual TrainJob level keeps the runtime flexible across different workload profiles. Acknowledged.

📝 Learnings were identified and not saved (knowledge base disabled). Enable

Signed-off-by: Kapil Nema <knema@redhat.com>

sutaakar · 2026-04-13T09:38:52Z

+                        - name: OMPI_MCA_mtl
+                          value: ^ofi
+                        - name: OMPI_MCA_btl
+                          value: tcp,self


IMHO we can remove these env vars, they could be provided in TrainJob. I have provided them for testing purposes for environments without RDMA. WDYT?

Agree, These variables can be provided by user via trainjob, removed the variables to follow generic runtime defination.

- Eliminated OMPI_MCA_mtl and OMPI_MCA_btl environment variables from the node and sshd containers in openmpi_cuda.yaml. - This streamlines the configuration and reduces potential conflicts in runtime settings.

sutaakar

/lgtm

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

Merge branch 'main' into add-mpi-ctr

6433630

Signed-off-by: Kapil Nema <knema@redhat.com>

kapil27 changed the title ~~Enhance OpenMPI-CUDA configuration~~ [DO NOT MERGE]: Enhance OpenMPI-CUDA configuration Apr 13, 2026

sutaakar reviewed Apr 13, 2026

View reviewed changes

Remove unnecessary environment variables from OpenMPI-CUDA configuration

aad4248

- Eliminated OMPI_MCA_mtl and OMPI_MCA_btl environment variables from the node and sshd containers in openmpi_cuda.yaml. - This streamlines the configuration and reduces potential conflicts in runtime settings.

sutaakar reviewed Apr 13, 2026

View reviewed changes

Conversation

kapil27 commented Apr 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Estimated code review effort

Actionable Issues

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kapil27 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kapil27 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

sutaakar Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

kapil27 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

sutaakar left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kapil27 commented Apr 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

coderabbitai Bot Apr 13, 2026 •

edited

Loading

coderabbitai Bot Apr 13, 2026 •

edited

Loading