fix(google): Add retry and status polling logic by mrusan · Pull Request #6379 · spinnaker/clouddriver

mrusan · 2025-07-31T17:38:27Z

The fix addresses an issue we've encountered, where during red-black deployments subsequent steps could execute before autoscaler creation finishes, causing inconsistent behavior and potential failures on delivery.

Now, the async operations use the GoogleOperationPoller which implements proper retry logic and handles operation status polling.

… backend services

dbyron-sf · 2025-07-31T19:18:14Z

...ovy/com/netflix/spinnaker/clouddriver/google/deploy/handlers/BasicGoogleDeployHandler.groovy

+          // Without waiting for autoscaler creation to complete, subsequent deployment steps (health checks, traffic routing)
+          // may execute before the autoscaler is active, leading to inconsistent behavior and potential deployment failures.
+          //
+          // This fix aligns Spinnaker GCP behavior with Spinnaker AWS behavior, where autoscaling group operations are synchronous.


This feels like a big enough change that it warrants a feature flag.

SO this is a bug, not a feature. The current implementation gets an async response. Instead of polling for the state of that response, it just accepts if it doesn't error & continues. Turns out... there are cases where that response is a PENDING sorta state that causes failures UNLESS you poll for completion.

The comment isn't exactly correct on the internals of where this fails...

Right, but...how many folks are depending on the current behavior? Or, what if this actually makes things worse (at least temporarily, while we work out the kinks)?

SO this LITERALLY can cause failures without this polling. spinnaker/spinnaker#7170 ssee for this

Yeah, it's a tough call. I bet there are some setups that are gonna start to fail when this behavior changes -- pipelines are gonna take longer to complete than people expect....even if what people expect is "wrong." In the spirit of reducing blast radius I think a feature flag is good idea, but I won't hold up the PR for it.

Let's look at FF. My only concern on FF is...that doing this as a flag may be MORE complicated/risky than without. That said, I understand the concerns!

Potential middle ground for a feature flag to opt-out rather than in, for people running into issues? With a request to post in Slack if needing to turn it on. If nobody messages after 1/2 releases, rip off the bandaid?

Yeah, I'm good with having the feature flag enabled by default.

@mattgogerly @dbyron-sf @jasonmcintosh thank you for reviewing the code and your input - I've added a enableAsyncOperationWait flag, enabled by default, along with a log message - haven't found other places where we log such messages, so let me know if there's a better example

dbyron-sf · 2025-07-31T19:22:00Z

...ovy/com/netflix/spinnaker/clouddriver/google/deploy/handlers/BasicGoogleDeployHandler.groovy

          registry
        )
+
+        if (operation) {


Under what circumstances is operation null? Is it worth logging a message or emitting a metric?

The previous invocation would return null always. Since polling for the state of the operation (i it's there) requires passing oepration information, we have to check if that's null before using the operation. This null check wasn't needed previously because we weren't polling for state previously.

clouddriver/clouddriver-google/src/main/java/com/netflix/spinnaker/clouddriver/google/deploy/SafeRetry.java

Line 89 in aa0c51a

return null;

and below where null was always returned.

I see....Wouldn't the world be a better place if SafeRetry.doRetry let the exception bubble up? Seems like we're effectively swallowing them which doesn't seem great.

Yes I agree ;) I think that's another area that could use some improvement ... probably separate from this.

…haviour

...ovy/com/netflix/spinnaker/clouddriver/google/deploy/handlers/BasicGoogleDeployHandler.groovy

jasonmcintosh · 2025-08-05T23:26:14Z

FYI I'm i think good with this... so if others are good, think we can get this merged. We're fixing GCP buidls in master now which are currently a bit broken :(

* fix(google): Add retry and status polling logic (spinnaker#6379) * chore(google): Add retry and status polling logic for autoscalers and backend services * chore(google): Added enableAsyncOperationWait=true flag for legacy behaviour * chore(google): Log enableAsyncOperationWait flag warning just once * fix(builds): Fixes upload URLs for releases * chore(accounts-api): Cherry-picks from monorepo to release-1.36.x --------- Co-authored-by: Maksim Rusan <[email protected]> Co-authored-by: Jason McIntosh <[email protected]>

chore(google): Add retry and status polling logic for autoscalers and…

aa0c51a

… backend services

mrusan requested review from plumpy and skandragon as code owners July 31, 2025 17:38

dbyron-sf reviewed Jul 31, 2025

View reviewed changes

jasonmcintosh changed the title ~~chore(google): Add retry and status polling logic for autoscalers and…~~ fix(google): Add retry and status polling logic Aug 4, 2025

chore(google): Added enableAsyncOperationWait=true flag for legacy be…

5e06d1d

…haviour

jasonmcintosh reviewed Aug 5, 2025

View reviewed changes

...ovy/com/netflix/spinnaker/clouddriver/google/deploy/handlers/BasicGoogleDeployHandler.groovy Outdated Show resolved Hide resolved

jasonmcintosh reviewed Aug 5, 2025

View reviewed changes

...ovy/com/netflix/spinnaker/clouddriver/google/deploy/handlers/BasicGoogleDeployHandler.groovy Outdated Show resolved Hide resolved

chore(google): Log enableAsyncOperationWait flag warning just once

2582295

jasonmcintosh approved these changes Aug 6, 2025

View reviewed changes

dbyron-sf approved these changes Aug 6, 2025

View reviewed changes

dbyron-sf closed this Aug 6, 2025

dbyron-sf reopened this Aug 6, 2025

jasonmcintosh merged commit 21102ec into spinnaker:release-1.36.x Aug 6, 2025
21 checks passed

This was referenced Aug 7, 2025

fix(google): Add retry and status polling logic #6380

Closed

fix(google): Add retry and status polling logic spinnaker/spinnaker#7191

Merged

mergify bot mentioned this pull request Aug 21, 2025

fix(google): Add retry and status polling logic (backport #7191) spinnaker/spinnaker#7217

Merged

Conversation

mrusan commented Jul 31, 2025 • edited by jasonmcintosh Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jasonmcintosh commented Aug 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mrusan commented Jul 31, 2025 •

edited by jasonmcintosh

Loading