Skip to content

Fix metadata-webhook cleanup race condition#3973

Open
cardil wants to merge 1 commit intoopenshift-knative:release-1.37from
cardil:bugfix/1.37/metadata-webhook-cleanup
Open

Fix metadata-webhook cleanup race condition#3973
cardil wants to merge 1 commit intoopenshift-knative:release-1.37from
cardil:bugfix/1.37/metadata-webhook-cleanup

Conversation

@cardil
Copy link
Member

@cardil cardil commented Feb 7, 2026

Problem

The deployment-upgrade-failure test occasionally leaves Route resources stuck with finalizers because the metadata-webhook service is deleted before the Route can be processed by the webhook.

When the webhook service is unavailable, the MutatingWebhookConfiguration still intercepts Route deletion requests, causing a timeout and leaving the Route stuck.

Root Cause

Race condition between:

  1. Test cleanup deleting the metadata-webhook Deployment/Service
  2. Knative garbage collection trying to finalize Routes
  3. Kubernetes webhook cache invalidation timing

Solution

Added namespaceSelector to the webhook configuration to limit its scope to only the serving-tests namespace (which already has the samples.knative.dev/release: devel label). This ensures:

  1. The webhook only applies to resources in namespaces with the correct label
  2. When the namespace is being torn down, the webhook no longer blocks deletions
  3. Other namespaces are not affected by the webhook

Evidence

  • The metadata-webhook is only deployed when MESH=true (see hack/lib/serverless.bash:195-198)
  • The serving-tests namespace already has the required label in 100-namespace.yaml
  • This change scopes the webhook to match its actual intended usage

Related

Upstream issue in knative/serving: cleanup order in deployment_failure.go deletes webhook before tearing down Service (to be reported separately).


Assisted-by: 🤖 Claude Opus/Sonnet 4.5

Add namespaceSelector to the MutatingWebhookConfiguration to limit
the webhook's scope to namespaces with the samples.knative.dev/release
label. This prevents the webhook from blocking resource deletions in
other namespaces when the serving-tests namespace is torn down.

The issue occurred during upgrade test cleanup where the Route resource
for deployment-upgrade-failure could not be deleted because the webhook
service was unavailable after namespace cleanup started.

Assisted-by: 🤖 Claude Opus/Sonnet 4.5
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cardil
Once this PR has been reviewed and has the lgtm label, please assign aliok for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cardil
Copy link
Member Author

cardil commented Feb 7, 2026

/cherrypick main
/cherrypick release-1.38

@openshift-cherrypick-robot
Copy link
Contributor

@cardil: once the present PR merges, I will cherry-pick it on top of main, release-1.38 in new PRs and assign them to you.

Details

In response to this:

/cherrypick main
/cherrypick release-1.38

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cardil
Copy link
Member Author

cardil commented Feb 7, 2026

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 7, 2026

@cardil: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/420-mesh-e2e 4d98201 link false /test 420-mesh-e2e
ci/prow/420-mesh-upgrade 4d98201 link false /test 420-mesh-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants