fix: handle startup-probe-phase pods in rolling update categorization#435
fix: handle startup-probe-phase pods in rolling update categorization#435gflarity wants to merge 3 commits intoai-dynamo:mainfrom
Conversation
operator/internal/controller/podclique/components/pod/rollingupdate.go
Outdated
Show resolved
Hide resolved
operator/internal/controller/podclique/components/pod/rollingupdate.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Fixes rolling update categorization so pods in the startup-probe phase (Running + Started=false/nil + Ready=false) are no longer silently dropped from computeUpdateWork, preventing updates from getting stuck or ending prematurely (issue #400).
Changes:
- Add
HasAnyContainerNotStartedto detect pods whose containers have not passed startup probe. - Expand
computeUpdateWorkwitholdTemplateHashStartingPodsandoldTemplateHashUncategorizedPodsbuckets, and ensure old-hash pods always land in a bucket. - Delete all non-ready old-hash pods (pending/unhealthy/starting/uncategorized) immediately; add unit tests for the new utility and categorization.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| operator/internal/utils/kubernetes/pod.go | Adds HasAnyContainerNotStarted pod utility. |
| operator/internal/utils/kubernetes/pod_test.go | Adds unit tests for HasAnyContainerNotStarted. |
| operator/internal/controller/podclique/components/pod/rollingupdate.go | Extends rolling update bucketing and expands non-ready old-hash deletion behavior. |
| operator/internal/controller/podclique/components/pod/rollingupdate_test.go | Adds unit tests covering computeUpdateWork bucketing, including startup-probe-phase pods. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2f5130d to
00f79c4
Compare
|
/ok to test 00f79c4 |
| oldTemplateHashUnhealthyPods []*corev1.Pod | ||
| oldTemplateHashReadyPods []*corev1.Pod | ||
| newTemplateHashReadyPods []*corev1.Pod | ||
| oldTemplateHashPendingPods []*corev1.Pod |
There was a problem hiding this comment.
nit: add a line comment on each pods slice here
newTemplateHashReadyPods []*corev1.Pod // newTemplateHashReadyPods holds the hash of the new ready pods
| @@ -73,8 +75,8 @@ func (w *updateWork) getNextPodToUpdate() *corev1.Pod { | |||
| func (r _resource) processPendingUpdates(logger logr.Logger, sc *syncContext) error { | |||
| work := r.computeUpdateWork(logger, sc) | |||
There was a problem hiding this comment.
nit: rename the work to updateWork, I think that the "update" part of "updateWork" tells more about what this structure holds
| ) | ||
|
|
||
| // newTestPod creates a pod with the given template hash label and options applied. | ||
| func newTestPod(templateHash string, opts ...func(*corev1.Pod)) *corev1.Pod { |
There was a problem hiding this comment.
This function (and all the rest of the test pod) should be in a different common file. Looks like something we could reuse
If not, maybe after the test function (TestComputeUpdateWork), which should be the main thing in this file
| func newTestPod(templateHash string, opts ...func(*corev1.Pod)) *corev1.Pod { | ||
| pod := &corev1.Pod{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: "p", |
There was a problem hiding this comment.
let's name the name an input arg
/kind bug
What this PR does / why we need it:
Fixes a bug where rolling updates get permanently stuck or prematurely complete when an old-hash pod is in the startup probe phase.
computeUpdateWorkhad no category for pods withStarted=false(startup probe not yet passed), causing them to silently fall through all classification branches.This PR:
HasAnyContainerNotStartedutility to detect pods still in the startup probe phaseoldTemplateHashStartingPodsandoldTemplateHashUncategorizedPodsbuckets toupdateWorkcomputeUpdateWorkcategorization andHasAnyContainerNotStartedWhich issue(s) this PR fixes:
Fixes #400
Special notes for your reviewer:
The root cause: when a second spec change arrives while a replacement pod from a previous change is still in its startup probe phase, the replacement pod (now old-hash) has
Phase=Running,Started=false,Ready=false. It fails all existing predicates (not Pending, not Started-but-not-Ready, not Ready) and is dropped from the update work entirely. This causes either an infinite requeue loop or prematureupdateEndedAtmarking.Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: