Skip to content

[DO NOT MERGE]: Enhance OpenMPI-CUDA configuration#145

Open
kapil27 wants to merge 3 commits intoopendatahub-io:mainfrom
kapil27:add-mpi-ctr
Open

[DO NOT MERGE]: Enhance OpenMPI-CUDA configuration#145
kapil27 wants to merge 3 commits intoopendatahub-io:mainfrom
kapil27:add-mpi-ctr

Conversation

@kapil27
Copy link
Copy Markdown

@kapil27 kapil27 commented Apr 13, 2026

  • Added a new image reference for OpenMPI-CUDA in params.env.
  • Updated kustomization.yaml to include image replacement for OpenMPI-CUDA in ClusterTrainingRuntime.
  • Included openmpi_cuda.yaml in the runtimes kustomization.

This update ensures the correct image is used for OpenMPI-CUDA deployments.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Summary by CodeRabbit

  • New Features
    • A new OpenMPI CUDA training runtime is available for GPU-accelerated distributed training, with SSH-based node communication, GPU resource requests, and a readiness probe for node health.
    • Configuration updated to include and deploy the new runtime image (quay.io/ksuta/odh-mpi-cuda:0.0.14).

- Added a new image reference for OpenMPI-CUDA in params.env.
- Updated kustomization.yaml to include image replacement for OpenMPI-CUDA in ClusterTrainingRuntime.
- Included openmpi_cuda.yaml in the runtimes kustomization.

This update ensures the correct image is used for OpenMPI-CUDA deployments.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

Important

Review skipped

Ignore keyword(s) in the title.

⛔ Ignored keywords (2)
  • WIP
  • DO NOT MERGE

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: aa70ee5a-3a19-4a7e-b402-e546c03219ff

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an OpenMPI CUDA ClusterTrainingRuntime named openmpi-cuda (new manifest), registers it in runtimes kustomization, adds an image param odh-openmpi-cuda-image set to quay.io/ksuta/odh-mpi-cuda:0.0.14 in params.env, and inserts an image replacement rule in the top-level kustomization to source that image from the rhoai-config ConfigMap and replace two container image fields in the runtime’s replicated jobs.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Actionable Issues

  • openmpi_cuda.yaml — launcher and node containers lack securityContext constraints (e.g., runAsNonRoot: true, allowPrivilegeEscalation: false, readOnlyRootFilesystem: true). Verify mpiuser UID/GID and add explicit restrictions. CWE-276, CWE-250.
  • openmpi_cuda.yaml — no imagePullPolicy set on containers. Explicitly set imagePullPolicy: IfNotPresent or Always to avoid unexpected caching behavior.
  • manifests/rhoai/params.env — image uses a mutable tag from an external registry (quay.io/ksuta/...:0.0.14) without pinned digest. Pin by digest (@sha256:...) to ensure image integrity and provenance. CWE-494.
  • openmpi_cuda.yaml — launcher replica has no resource requests/limits defined. Add resources.requests/limits (CPU/memory) consistent with node replica to prevent scheduler/resource contention.
  • openmpi_cuda.yaml — sshd config path (/etc/ssh/sshd_config_mpi) and invocation of sshd assumed present in image. Confirm the image contains this file and that SSH keys/permissions are correctly provisioned; avoid embedding secrets in plain files. CWE-276.
  • openmpi_cuda.yaml — containers request GPU (nvidia.com/gpu: 1) but do not specify tolerations or nodeSelector/affinity for GPU nodes. Ensure scheduling constraints are present where necessary.
🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title '[DO NOT MERGE]: Enhance OpenMPI-CUDA configuration' is misleading and contradicts the PR's actual intent; the '[DO NOT MERGE]' prefix suggests blocking status while the PR is open and contains substantive changes ready for review. Remove the '[DO NOT MERGE]' prefix. Use 'Add OpenMPI-CUDA runtime configuration' or similar to accurately reflect the addition of runtime manifests, kustomization rules, and image parameters.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@manifests/rhoai/params.env`:
- Line 7: The odh-openmpi-cuda-image entry uses a mutable tag
("odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda:0.0.14"); replace this with
the image's immutable digest (e.g., quay.io/ksuta/odh-mpi-cuda@sha256:<digest>)
to prevent mutable-tag risks, update any manifests or helm values that consume
odh-openmpi-cuda-image to use the digest form, and verify the chosen sha256
digest by pulling or inspecting the registry before committing.

In `@manifests/rhoai/runtimes/openmpi_cuda.yaml`:
- Around line 42-46: The manifest currently only specifies GPU resources
(nvidia.com/gpu) in the resources.limits/requests blocks; add explicit CPU and
memory requests and limits for both replicated job specs so the scheduler can
make proper placement and eviction decisions. Update the resources section
alongside the existing nvidia.com/gpu entries (the resources.limits and
resources.requests blocks in the openmpi_cuda.yaml job/container specs) to
include something like cpu and memory limit and request values appropriate for
the workload, and apply the same additions to the second occurrence referenced
in the file so both replicated jobs have matching CPU/memory resource
boundaries.
- Around line 34-46: The container specs for the workload (container name "node"
using image quay.io/ksuta/odh-mpi-cuda:0.0.14 and the sshd container mentioned)
lack an explicit securityContext; add a securityContext block to each workload
container (including the "node" and the sshd container) setting runAsNonRoot:
true, runAsUser to a non-root UID, readOnlyRootFilesystem: true,
allowPrivilegeEscalation: false, noNewPrivileges: true, drop all capabilities
(and only add minimal needed if any), and include a seccompProfile
(RuntimeDefault) to harden the container and prevent privilege escalation and
lateral movement. Ensure the same pattern is applied to both container
definitions referenced in the diff.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 019092a5-8260-4213-a873-71c10e229541

📥 Commits

Reviewing files that changed from the base of the PR and between c2c1ebc and 0f8b08c.

📒 Files selected for processing (4)
  • manifests/rhoai/kustomization.yaml
  • manifests/rhoai/params.env
  • manifests/rhoai/runtimes/kustomization.yaml
  • manifests/rhoai/runtimes/openmpi_cuda.yaml

odh-th06-cuda130-torch291-py312-image=quay.io/opendatahub/odh-th06-cuda130-torch291-py312:odh-3.4-ea2
odh-th06-rocm64-torch291-py312-image=quay.io/opendatahub/odh-th06-rocm64-torch291-py312:odh-3.4-ea2
odh-th06-cpu-torch291-py312-image=quay.io/opendatahub/odh-th06-cpu-torch291-py312:odh-3.4-ea2
odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda:0.0.14
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify mutable tag usage for this image in the changed manifests.
rg -n 'quay\.io/ksuta/odh-mpi-cuda:[^ @]+' manifests/rhoai/params.env manifests/rhoai/runtimes/openmpi_cuda.yaml

# Expected after fix: no matches above; only `@sha256` form remains.
rg -n 'quay\.io/ksuta/odh-mpi-cuda@sha256:' manifests/rhoai/params.env manifests/rhoai/runtimes/openmpi_cuda.yaml

Repository: opendatahub-io/trainer

Length of output: 369


Pin the image to an immutable digest (CWE-494, Severity: Major).

Line 7 uses a mutable tag. If 0.0.14 is re-pushed or the registry account is compromised, newly scheduled pods pull attacker-controlled content.

Remediation patch
-odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda:0.0.14
+odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda@sha256:<verified_digest_for_0.0.14>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/rhoai/params.env` at line 7, The odh-openmpi-cuda-image entry uses
a mutable tag ("odh-openmpi-cuda-image=quay.io/ksuta/odh-mpi-cuda:0.0.14");
replace this with the image's immutable digest (e.g.,
quay.io/ksuta/odh-mpi-cuda@sha256:<digest>) to prevent mutable-tag risks, update
any manifests or helm values that consume odh-openmpi-cuda-image to use the
digest form, and verify the chosen sha256 digest by pulling or inspecting the
registry before committing.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We’re keeping this as-is for now because this image has not yet been onboarded to the managed ODH/MODH registry flow that the other runtime images use. Once that onboarding is complete, we plan to align it with the existing image management pattern, including using the managed registry reference rather than treating this temporary external image reference as final.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kapil27, understood — that context makes sense. I'll note this for future reviews so the mutable tag isn't re-flagged while the onboarding is in progress.

Would you like me to open a follow-up GitHub issue to track the onboarding of quay.io/ksuta/odh-mpi-cuda to the managed ODH/MODH registry and the subsequent switch to an immutable digest reference?


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Comment thread manifests/rhoai/runtimes/openmpi_cuda.yaml
Comment on lines +42 to +46
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Define CPU/memory requests and limits for both replicated jobs (Severity: Major).

Lines 42-46 and 73-77 set GPU resources only. Without CPU/memory bounds, scheduling and eviction behavior is unstable under contention, and noisy-neighbor impact increases.

Remediation patch
                       resources:
                         limits:
+                          cpu: "4"
+                          memory: "16Gi"
                           nvidia.com/gpu: "1"
                         requests:
+                          cpu: "2"
+                          memory: "8Gi"
                           nvidia.com/gpu: "1"
@@
                       resources:
                         limits:
+                          cpu: "4"
+                          memory: "16Gi"
                           nvidia.com/gpu: "1"
                         requests:
+                          cpu: "2"
+                          memory: "8Gi"
                           nvidia.com/gpu: "1"

As per coding guidelines, **: REVIEW PRIORITIES: 4. Performance problems.

Also applies to: 73-77

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/rhoai/runtimes/openmpi_cuda.yaml` around lines 42 - 46, The
manifest currently only specifies GPU resources (nvidia.com/gpu) in the
resources.limits/requests blocks; add explicit CPU and memory requests and
limits for both replicated job specs so the scheduler can make proper placement
and eviction decisions. Update the resources section alongside the existing
nvidia.com/gpu entries (the resources.limits and resources.requests blocks in
the openmpi_cuda.yaml job/container specs) to include something like cpu and
memory limit and request values appropriate for the workload, and apply the same
additions to the second occurrence referenced in the file so both replicated
jobs have matching CPU/memory resource boundaries.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I’m keeping CPU and memory unspecified in this runtime for now. The GPU request is the resource requirement that is functionally necessary for this MPI runtime, while CPU and memory sizing is workload-dependent and can vary significantly by training job. Hardcoding guessed CPU/memory values at the runtime layer would make this manifest less reusable without clear evidence that these defaults are appropriate.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kapil27, understood — that's a reasonable design call. ClusterTrainingRuntime serves as a reusable base template, and deferring CPU/memory sizing to the individual TrainJob level keeps the runtime flexible across different workload profiles. Acknowledged.


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Signed-off-by: Kapil Nema <knema@redhat.com>
@kapil27 kapil27 changed the title Enhance OpenMPI-CUDA configuration [DO NOT MERGE]: Enhance OpenMPI-CUDA configuration Apr 13, 2026
Comment on lines +65 to +68
- name: OMPI_MCA_mtl
value: ^ofi
- name: OMPI_MCA_btl
value: tcp,self
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we can remove these env vars, they could be provided in TrainJob. I have provided them for testing purposes for environments without RDMA. WDYT?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, These variables can be provided by user via trainjob, removed the variables to follow generic runtime defination.

- Eliminated OMPI_MCA_mtl and OMPI_MCA_btl environment variables from the node and sshd containers in openmpi_cuda.yaml.
- This streamlines the configuration and reduces potential conflicts in runtime settings.
Copy link
Copy Markdown
Collaborator

@sutaakar sutaakar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants