fix: handle startup-probe-phase pods in rolling update categorization by gflarity · Pull Request #435 · ai-dynamo/grove

gflarity · 2026-02-12T14:56:41Z

/kind bug

What this PR does / why we need it:

Fixes a bug where rolling updates get permanently stuck or prematurely complete when an old-hash pod is in the startup probe phase. computeUpdateWork had no category for pods with Started=false (startup probe not yet passed), causing them to silently fall through all classification branches.

This PR:

Adds a HasAnyContainerNotStarted utility to detect pods still in the startup probe phase
Adds explicit oldTemplateHashStartingPods and oldTemplateHashUncategorizedPods buckets to updateWork
Deletes all non-ready old-hash pods (pending, unhealthy, starting, uncategorized) immediately rather than only pending/unhealthy
Adds unit tests for computeUpdateWork categorization and HasAnyContainerNotStarted

Which issue(s) this PR fixes:

Fixes #400

Special notes for your reviewer:

The root cause: when a second spec change arrives while a replacement pod from a previous change is still in its startup probe phase, the replacement pod (now old-hash) has Phase=Running, Started=false, Ready=false. It fails all existing predicates (not Pending, not Started-but-not-Ready, not Ready) and is dropped from the update work entirely. This causes either an infinite requeue loop or premature updateEndedAt marking.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

copy-pr-bot · 2026-02-12T14:56:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

operator/internal/controller/podclique/components/pod/rollingupdate.go

Copilot

Pull request overview

Fixes rolling update categorization so pods in the startup-probe phase (Running + Started=false/nil + Ready=false) are no longer silently dropped from computeUpdateWork, preventing updates from getting stuck or ending prematurely (issue #400).

Changes:

Add HasAnyContainerNotStarted to detect pods whose containers have not passed startup probe.
Expand computeUpdateWork with oldTemplateHashStartingPods and oldTemplateHashUncategorizedPods buckets, and ensure old-hash pods always land in a bucket.
Delete all non-ready old-hash pods (pending/unhealthy/starting/uncategorized) immediately; add unit tests for the new utility and categorization.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
operator/internal/utils/kubernetes/pod.go	Adds `HasAnyContainerNotStarted` pod utility.
operator/internal/utils/kubernetes/pod_test.go	Adds unit tests for `HasAnyContainerNotStarted`.
operator/internal/controller/podclique/components/pod/rollingupdate.go	Extends rolling update bucketing and expands non-ready old-hash deletion behavior.
operator/internal/controller/podclique/components/pod/rollingupdate_test.go	Adds unit tests covering `computeUpdateWork` bucketing, including startup-probe-phase pods.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ai-dynamo#400)

gflarity · 2026-02-17T15:38:01Z

/ok to test 00f79c4

shmuel-runai · 2026-02-19T15:20:33Z

operator/internal/controller/podclique/components/pod/rollingupdate.go

-	oldTemplateHashUnhealthyPods []*corev1.Pod
-	oldTemplateHashReadyPods     []*corev1.Pod
-	newTemplateHashReadyPods     []*corev1.Pod
+	oldTemplateHashPendingPods       []*corev1.Pod


nit: add a line comment on each pods slice here

newTemplateHashReadyPods []*corev1.Pod // newTemplateHashReadyPods holds the hash of the new ready pods

shmuel-runai · 2026-02-19T15:22:45Z

operator/internal/controller/podclique/components/pod/rollingupdate.go

@@ -73,8 +75,8 @@ func (w *updateWork) getNextPodToUpdate() *corev1.Pod {
 func (r _resource) processPendingUpdates(logger logr.Logger, sc *syncContext) error {
 	work := r.computeUpdateWork(logger, sc)


nit: rename the work to updateWork, I think that the "update" part of "updateWork" tells more about what this structure holds

shmuel-runai · 2026-02-19T15:30:09Z

operator/internal/controller/podclique/components/pod/rollingupdate_test.go

+)
+
+// newTestPod creates a pod with the given template hash label and options applied.
+func newTestPod(templateHash string, opts ...func(*corev1.Pod)) *corev1.Pod {


This function (and all the rest of the test pod) should be in a different common file. Looks like something we could reuse
If not, maybe after the test function (TestComputeUpdateWork), which should be the main thing in this file

shmuel-runai · 2026-02-19T15:30:46Z

operator/internal/controller/podclique/components/pod/rollingupdate_test.go

+func newTestPod(templateHash string, opts ...func(*corev1.Pod)) *corev1.Pod {
+	pod := &corev1.Pod{
+		ObjectMeta: metav1.ObjectMeta{
+			Name:      "p",


let's name the name an input arg

gflarity requested review from Ronkahn21, sanjaychatterjee, shayasoolin and unmarshall as code owners February 12, 2026 14:56

julienmancuso previously approved these changes Feb 12, 2026

View reviewed changes

gflarity dismissed julienmancuso’s stale review via 2f5130d February 12, 2026 15:50

gflarity requested a review from julienmancuso February 14, 2026 16:31

Ronkahn21 requested a review from Copilot February 17, 2026 09:57

Copilot started reviewing on behalf of Ronkahn21 February 17, 2026 09:58 View session

Ronkahn21 reviewed Feb 17, 2026

View reviewed changes

operator/internal/controller/podclique/components/pod/rollingupdate.go Outdated Show resolved Hide resolved

operator/internal/controller/podclique/components/pod/rollingupdate.go Outdated Show resolved Hide resolved

Copilot AI reviewed Feb 17, 2026

View reviewed changes

gflarity added 3 commits February 17, 2026 10:35

fix: handle startup-probe-phase pods in rolling update categorization (…

64b1788

…ai-dynamo#400)

lint fix

59143b2

pr feedback

00f79c4

gflarity force-pushed the rolling_update_stuck_400 branch from 2f5130d to 00f79c4 Compare February 17, 2026 15:35

Ronkahn21 approved these changes Feb 17, 2026

View reviewed changes

shmuel-runai reviewed Feb 19, 2026

View reviewed changes

julienmancuso approved these changes Feb 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle startup-probe-phase pods in rolling update categorization#435

fix: handle startup-probe-phase pods in rolling update categorization#435
gflarity wants to merge 3 commits intoai-dynamo:mainfrom
gflarity:rolling_update_stuck_400

gflarity commented Feb 12, 2026

Uh oh!

copy-pr-bot bot commented Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

gflarity commented Feb 17, 2026

Uh oh!

shmuel-runai Feb 19, 2026

Uh oh!

shmuel-runai Feb 19, 2026

Uh oh!

shmuel-runai Feb 19, 2026

Uh oh!

shmuel-runai Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

		@@ -73,8 +75,8 @@ func (w updateWork) getNextPodToUpdate() corev1.Pod {
		func (r _resource) processPendingUpdates(logger logr.Logger, sc *syncContext) error {
		work := r.computeUpdateWork(logger, sc)

Conversation

gflarity commented Feb 12, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

copy-pr-bot bot commented Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

gflarity commented Feb 17, 2026

Uh oh!

shmuel-runai Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

shmuel-runai Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

shmuel-runai Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

shmuel-runai Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments