Skip to content

fix: handle startup-probe-phase pods in rolling update categorization#435

Open
gflarity wants to merge 3 commits intoai-dynamo:mainfrom
gflarity:rolling_update_stuck_400
Open

fix: handle startup-probe-phase pods in rolling update categorization#435
gflarity wants to merge 3 commits intoai-dynamo:mainfrom
gflarity:rolling_update_stuck_400

Conversation

@gflarity
Copy link
Contributor

/kind bug

What this PR does / why we need it:

Fixes a bug where rolling updates get permanently stuck or prematurely complete when an old-hash pod is in the startup probe phase. computeUpdateWork had no category for pods with Started=false (startup probe not yet passed), causing them to silently fall through all classification branches.

This PR:

  • Adds a HasAnyContainerNotStarted utility to detect pods still in the startup probe phase
  • Adds explicit oldTemplateHashStartingPods and oldTemplateHashUncategorizedPods buckets to updateWork
  • Deletes all non-ready old-hash pods (pending, unhealthy, starting, uncategorized) immediately rather than only pending/unhealthy
  • Adds unit tests for computeUpdateWork categorization and HasAnyContainerNotStarted

Which issue(s) this PR fixes:

Fixes #400

Special notes for your reviewer:

The root cause: when a second spec change arrives while a replacement pod from a previous change is still in its startup probe phase, the replacement pod (now old-hash) has Phase=Running, Started=false, Ready=false. It fails all existing predicates (not Pending, not Started-but-not-Ready, not Ready) and is dropped from the update work entirely. This causes either an infinite requeue loop or premature updateEndedAt marking.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

julienmancuso
julienmancuso previously approved these changes Feb 12, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes rolling update categorization so pods in the startup-probe phase (Running + Started=false/nil + Ready=false) are no longer silently dropped from computeUpdateWork, preventing updates from getting stuck or ending prematurely (issue #400).

Changes:

  • Add HasAnyContainerNotStarted to detect pods whose containers have not passed startup probe.
  • Expand computeUpdateWork with oldTemplateHashStartingPods and oldTemplateHashUncategorizedPods buckets, and ensure old-hash pods always land in a bucket.
  • Delete all non-ready old-hash pods (pending/unhealthy/starting/uncategorized) immediately; add unit tests for the new utility and categorization.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
operator/internal/utils/kubernetes/pod.go Adds HasAnyContainerNotStarted pod utility.
operator/internal/utils/kubernetes/pod_test.go Adds unit tests for HasAnyContainerNotStarted.
operator/internal/controller/podclique/components/pod/rollingupdate.go Extends rolling update bucketing and expands non-ready old-hash deletion behavior.
operator/internal/controller/podclique/components/pod/rollingupdate_test.go Adds unit tests covering computeUpdateWork bucketing, including startup-probe-phase pods.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gflarity gflarity force-pushed the rolling_update_stuck_400 branch from 2f5130d to 00f79c4 Compare February 17, 2026 15:35
@gflarity
Copy link
Contributor Author

/ok to test 00f79c4

oldTemplateHashUnhealthyPods []*corev1.Pod
oldTemplateHashReadyPods []*corev1.Pod
newTemplateHashReadyPods []*corev1.Pod
oldTemplateHashPendingPods []*corev1.Pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a line comment on each pods slice here

newTemplateHashReadyPods []*corev1.Pod // newTemplateHashReadyPods holds the hash of the new ready pods

@@ -73,8 +75,8 @@ func (w *updateWork) getNextPodToUpdate() *corev1.Pod {
func (r _resource) processPendingUpdates(logger logr.Logger, sc *syncContext) error {
work := r.computeUpdateWork(logger, sc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename the work to updateWork, I think that the "update" part of "updateWork" tells more about what this structure holds

)

// newTestPod creates a pod with the given template hash label and options applied.
func newTestPod(templateHash string, opts ...func(*corev1.Pod)) *corev1.Pod {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function (and all the rest of the test pod) should be in a different common file. Looks like something we could reuse
If not, maybe after the test function (TestComputeUpdateWork), which should be the main thing in this file

func newTestPod(templateHash string, opts ...func(*corev1.Pod)) *corev1.Pod {
pod := &corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "p",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's name the name an input arg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rolling Update Gets Stuck When New Update Initiated During In-Progress Update

4 participants

Comments