-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Checklist:
- I've included steps to reproduce the bug.
- I've included the version of argo rollouts.
Describe the bug
When a new rollout is deployed while a canary deployment is in progress (e.g., at 80% canary weight) with traffic routing (Istio), we observe that the old canary ReplicaSet appears to be scaled down immediately, which may not be respecting scaleDownDelaySeconds. This seems to cause traffic routing errors ("UNAVAILABLE: no healthy upstream"), possibly because Istio Envoy proxies still route traffic to the old canary pods while they are being terminated.
Suspected Root Cause:
When transitioning from V1 (stable) β V2 (canary at 80%) β V3 (new canary), we suspect the controller may be:
- Updating
status.canary.weights.canary.podTemplateHashto V3 inreconcileTrafficRouting()(rollout/trafficrouting.go:303-307) - Treating V2 as no longer "referenced" because
isReplicaSetReferenced()only checks current status (rollout/replicaset.go:359-361) - Treating V2 as an "intermediate RS" and scaling to 0 immediately without delay (rollout/canary.go:236-250)
The code at rollout/canary.go:241 assumes: "It is safe to scale the intermediate RS down, since no traffic is directed to it" β however, when traffic routing is configured, this assumption may not hold because:
- Istio VirtualService may still have 80% weight pointing to V2
- Istio Envoy proxies may need time (default 30s) to propagate the new configuration
- V2 pods could be terminated while still receiving traffic
To Reproduce
-
Create a Rollout with:
- Canary strategy with Istio traffic routing
- Multi-step canary (e.g., 20%, 40%, 60%, 80%, 100%)
scaleDownDelaySeconds: 30(or use default)
-
Deploy version V1 (fully promoted)
-
Deploy version V2 and make rollout to 80% canary weight
-
Deploy version V3 (trigger a new rollout)
-
Observe:
- V2 pods appear to be scaled to 0 immediately
- Istio may still try to route 80% traffic to V2, because the traffic weight change is not propagated yet
Expected behavior
We would expect:
- V2 ReplicaSet to remain scaled up for
scaleDownDelaySeconds(default 30s) - Traffic weight to shift to V1 (stable) first
- After the delay, V2 to scale down
- No downtime to occur
Comparison with Abort:
When aborting a canary (instead of deploying V3), the delay seems to work as expected:
- V2 stays as
newRS(not moved tootherRSs) - V2 pods remain for 30 seconds before scaling down
- No downtime occurs
Version
v1.7.2
Code References
The issue appears to be in rollout/canary.go:236-250 (scaleDownOldReplicaSetsForCanary):
} else {
// If we get here, we are *not* fully promoted and are in the middle of an update.
// We just encountered a scaled up ReplicaSet which is neither the stable or canary
// and doesn't yet have scale down deadline. This happens when a user changes their
// mind in the middle of an V1 -> V2 update, and then applies a V3. We are deciding
// what to do with the defunct, intermediate V2 ReplicaSet right now.
// It is safe to scale the intermediate RS down, since no traffic is directed to it.
c.log.Infof("scaling down intermediate RS '%s'", targetRS.Name)
}
// Scale down.
_, _, err = c.scaleReplicaSetAndRecordEvent(targetRS, desiredReplicaCount)The comment "It is safe to scale the intermediate RS down, since no traffic is directed to it" may not be accurate when traffic routing is configured, as traffic could still be directed to V2 via Istio until Envoy proxies receive and apply the updated configuration.
Message from the maintainers:
Impacted by this bug? Give it a π. We prioritize the issues with the most π.