fix(google): Add retry and status polling logic#6379
fix(google): Add retry and status polling logic#6379jasonmcintosh merged 3 commits intospinnaker:release-1.36.xfrom
Conversation
… backend services
| // Without waiting for autoscaler creation to complete, subsequent deployment steps (health checks, traffic routing) | ||
| // may execute before the autoscaler is active, leading to inconsistent behavior and potential deployment failures. | ||
| // | ||
| // This fix aligns Spinnaker GCP behavior with Spinnaker AWS behavior, where autoscaling group operations are synchronous. |
There was a problem hiding this comment.
This feels like a big enough change that it warrants a feature flag.
There was a problem hiding this comment.
SO this is a bug, not a feature. The current implementation gets an async response. Instead of polling for the state of that response, it just accepts if it doesn't error & continues. Turns out... there are cases where that response is a PENDING sorta state that causes failures UNLESS you poll for completion.
There was a problem hiding this comment.
The comment isn't exactly correct on the internals of where this fails...
There was a problem hiding this comment.
Right, but...how many folks are depending on the current behavior? Or, what if this actually makes things worse (at least temporarily, while we work out the kinks)?
There was a problem hiding this comment.
SO this LITERALLY can cause failures without this polling. spinnaker/spinnaker#7170 ssee for this
There was a problem hiding this comment.
Yeah, it's a tough call. I bet there are some setups that are gonna start to fail when this behavior changes -- pipelines are gonna take longer to complete than people expect....even if what people expect is "wrong." In the spirit of reducing blast radius I think a feature flag is good idea, but I won't hold up the PR for it.
There was a problem hiding this comment.
Let's look at FF. My only concern on FF is...that doing this as a flag may be MORE complicated/risky than without. That said, I understand the concerns!
There was a problem hiding this comment.
Potential middle ground for a feature flag to opt-out rather than in, for people running into issues? With a request to post in Slack if needing to turn it on. If nobody messages after 1/2 releases, rip off the bandaid?
There was a problem hiding this comment.
Yeah, I'm good with having the feature flag enabled by default.
There was a problem hiding this comment.
@mattgogerly @dbyron-sf @jasonmcintosh thank you for reviewing the code and your input - I've added a enableAsyncOperationWait flag, enabled by default, along with a log message - haven't found other places where we log such messages, so let me know if there's a better example
| registry | ||
| ) | ||
|
|
||
| if (operation) { |
There was a problem hiding this comment.
Under what circumstances is operation null? Is it worth logging a message or emitting a metric?
There was a problem hiding this comment.
The previous invocation would return null always. Since polling for the state of the operation (i it's there) requires passing oepration information, we have to check if that's null before using the operation. This null check wasn't needed previously because we weren't polling for state previously.
and below where null was always returned.There was a problem hiding this comment.
I see....Wouldn't the world be a better place if SafeRetry.doRetry let the exception bubble up? Seems like we're effectively swallowing them which doesn't seem great.
There was a problem hiding this comment.
Yes I agree ;) I think that's another area that could use some improvement ... probably separate from this.
...ovy/com/netflix/spinnaker/clouddriver/google/deploy/handlers/BasicGoogleDeployHandler.groovy
Outdated
Show resolved
Hide resolved
...ovy/com/netflix/spinnaker/clouddriver/google/deploy/handlers/BasicGoogleDeployHandler.groovy
Outdated
Show resolved
Hide resolved
|
FYI I'm i think good with this... so if others are good, think we can get this merged. We're fixing GCP buidls in master now which are currently a bit broken :( |
* fix(google): Add retry and status polling logic (spinnaker#6379) * chore(google): Add retry and status polling logic for autoscalers and backend services * chore(google): Added enableAsyncOperationWait=true flag for legacy behaviour * chore(google): Log enableAsyncOperationWait flag warning just once * fix(builds): Fixes upload URLs for releases * chore(accounts-api): Cherry-picks from monorepo to release-1.36.x --------- Co-authored-by: Maksim Rusan <[email protected]> Co-authored-by: Jason McIntosh <[email protected]>
The fix addresses an issue we've encountered, where during red-black deployments subsequent steps could execute before autoscaler creation finishes, causing inconsistent behavior and potential failures on delivery.
Now, the async operations use the GoogleOperationPoller which implements proper retry logic and handles operation status polling.