ScaleDownDelay not applied when new rollout starts during active canary with traffic routing, causing downtime

Checklist:

* [x] I've included steps to reproduce the bug.
* [x] I've included the version of argo rollouts.

**Describe the bug**

When a new rollout is deployed while a canary deployment is in progress (e.g., at 80% canary weight) with traffic routing (Istio), we observe that the old canary ReplicaSet appears to be scaled down immediately, which may not be respecting `scaleDownDelaySeconds`. This seems to cause traffic routing errors ("UNAVAILABLE: no healthy upstream"), possibly because Istio Envoy proxies still route traffic to the old canary pods while they are being terminated.

**Suspected Root Cause:**

When transitioning from V1 (stable) → V2 (canary at 80%) → V3 (new canary), we suspect the controller may be:

1. Updating `status.canary.weights.canary.podTemplateHash` to V3 in `reconcileTrafficRouting()` ([rollout/trafficrouting.go:303-307](https://github.com/argoproj/argo-rollouts/blob/a982811272f2934b3fb3a9cd8fc2708832e692ff/rollout/trafficrouting.go#L303-L307))
2. Treating V2 as no longer "referenced" because `isReplicaSetReferenced()` only checks current status ([rollout/replicaset.go:359-361](https://github.com/argoproj/argo-rollouts/blob/a982811272f2934b3fb3a9cd8fc2708832e692ff/rollout/replicaset.go#L359-L361))
3. Treating V2 as an "intermediate RS" and scaling to 0 immediately without delay ([rollout/canary.go:236-250](https://github.com/argoproj/argo-rollouts/blob/a982811272f2934b3fb3a9cd8fc2708832e692ff/rollout/canary.go#L236-L250))

The code at [rollout/canary.go:241](https://github.com/argoproj/argo-rollouts/blob/a982811272f2934b3fb3a9cd8fc2708832e692ff/rollout/canary.go#L241) assumes: "It is safe to scale the intermediate RS down, since no traffic is directed to it" — however, when traffic routing is configured, this assumption may not hold because:
- Istio VirtualService may still have 80% weight pointing to V2
- Istio Envoy proxies may need time (default 30s) to propagate the new configuration
- V2 pods could be terminated while still receiving traffic

**To Reproduce**

1. Create a Rollout with:
   - Canary strategy with Istio traffic routing
   - Multi-step canary (e.g., 20%, 40%, 60%, 80%, 100%)
   - `scaleDownDelaySeconds: 30` (or use default)

2. Deploy version V1 (fully promoted)

3. Deploy version V2 and make rollout to 80% canary weight

4. Deploy version V3 (trigger a new rollout)

5. Observe:
   - V2 pods appear to be scaled to 0 immediately
   - Istio may still try to route 80% traffic to V2, because the traffic weight change is not propagated yet

**Expected behavior**

We would expect:
- V2 ReplicaSet to remain scaled up for `scaleDownDelaySeconds` (default 30s)
- Traffic weight to shift to V1 (stable) first
- After the delay, V2 to scale down
- No downtime to occur

**Comparison with Abort:**

When **aborting** a canary (instead of deploying V3), the delay seems to work as expected:
- V2 stays as `newRS` (not moved to `otherRSs`)
- V2 pods remain for 30 seconds before scaling down
- No downtime occurs

**Version**

v1.7.2

**Code References**

The issue appears to be in [rollout/canary.go:236-250](https://github.com/argoproj/argo-rollouts/blob/a982811272f2934b3fb3a9cd8fc2708832e692ff/rollout/canary.go#L236-L250) (`scaleDownOldReplicaSetsForCanary`):

```go
} else {
    // If we get here, we are *not* fully promoted and are in the middle of an update.
    // We just encountered a scaled up ReplicaSet which is neither the stable or canary
    // and doesn't yet have scale down deadline. This happens when a user changes their
    // mind in the middle of an V1 -> V2 update, and then applies a V3. We are deciding
    // what to do with the defunct, intermediate V2 ReplicaSet right now.
    // It is safe to scale the intermediate RS down, since no traffic is directed to it.
    c.log.Infof("scaling down intermediate RS '%s'", targetRS.Name)
}
// Scale down.
_, _, err = c.scaleReplicaSetAndRecordEvent(targetRS, desiredReplicaCount)
```

The comment "It is safe to scale the intermediate RS down, since no traffic is directed to it" may not be accurate when traffic routing is configured, as traffic could still be directed to V2 via Istio until Envoy proxies receive and apply the updated configuration.


---

**Message from the maintainers**:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ScaleDownDelay not applied when new rollout starts during active canary with traffic routing, causing downtime #4534

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ScaleDownDelay not applied when new rollout starts during active canary with traffic routing, causing downtime #4534

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions