Skip to content

ScaleDownDelay not applied when new rollout starts during active canary with traffic routing, causing downtimeΒ #4534

@Rajin9601

Description

@Rajin9601

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug

When a new rollout is deployed while a canary deployment is in progress (e.g., at 80% canary weight) with traffic routing (Istio), we observe that the old canary ReplicaSet appears to be scaled down immediately, which may not be respecting scaleDownDelaySeconds. This seems to cause traffic routing errors ("UNAVAILABLE: no healthy upstream"), possibly because Istio Envoy proxies still route traffic to the old canary pods while they are being terminated.

Suspected Root Cause:

When transitioning from V1 (stable) β†’ V2 (canary at 80%) β†’ V3 (new canary), we suspect the controller may be:

  1. Updating status.canary.weights.canary.podTemplateHash to V3 in reconcileTrafficRouting() (rollout/trafficrouting.go:303-307)
  2. Treating V2 as no longer "referenced" because isReplicaSetReferenced() only checks current status (rollout/replicaset.go:359-361)
  3. Treating V2 as an "intermediate RS" and scaling to 0 immediately without delay (rollout/canary.go:236-250)

The code at rollout/canary.go:241 assumes: "It is safe to scale the intermediate RS down, since no traffic is directed to it" β€” however, when traffic routing is configured, this assumption may not hold because:

  • Istio VirtualService may still have 80% weight pointing to V2
  • Istio Envoy proxies may need time (default 30s) to propagate the new configuration
  • V2 pods could be terminated while still receiving traffic

To Reproduce

  1. Create a Rollout with:

    • Canary strategy with Istio traffic routing
    • Multi-step canary (e.g., 20%, 40%, 60%, 80%, 100%)
    • scaleDownDelaySeconds: 30 (or use default)
  2. Deploy version V1 (fully promoted)

  3. Deploy version V2 and make rollout to 80% canary weight

  4. Deploy version V3 (trigger a new rollout)

  5. Observe:

    • V2 pods appear to be scaled to 0 immediately
    • Istio may still try to route 80% traffic to V2, because the traffic weight change is not propagated yet

Expected behavior

We would expect:

  • V2 ReplicaSet to remain scaled up for scaleDownDelaySeconds (default 30s)
  • Traffic weight to shift to V1 (stable) first
  • After the delay, V2 to scale down
  • No downtime to occur

Comparison with Abort:

When aborting a canary (instead of deploying V3), the delay seems to work as expected:

  • V2 stays as newRS (not moved to otherRSs)
  • V2 pods remain for 30 seconds before scaling down
  • No downtime occurs

Version

v1.7.2

Code References

The issue appears to be in rollout/canary.go:236-250 (scaleDownOldReplicaSetsForCanary):

} else {
    // If we get here, we are *not* fully promoted and are in the middle of an update.
    // We just encountered a scaled up ReplicaSet which is neither the stable or canary
    // and doesn't yet have scale down deadline. This happens when a user changes their
    // mind in the middle of an V1 -> V2 update, and then applies a V3. We are deciding
    // what to do with the defunct, intermediate V2 ReplicaSet right now.
    // It is safe to scale the intermediate RS down, since no traffic is directed to it.
    c.log.Infof("scaling down intermediate RS '%s'", targetRS.Name)
}
// Scale down.
_, _, err = c.scaleReplicaSetAndRecordEvent(targetRS, desiredReplicaCount)

The comment "It is safe to scale the intermediate RS down, since no traffic is directed to it" may not be accurate when traffic routing is configured, as traffic could still be directed to V2 via Istio until Envoy proxies receive and apply the updated configuration.


Message from the maintainers:

Impacted by this bug? Give it a πŸ‘. We prioritize the issues with the most πŸ‘.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions