Skip to content

Fix router controllers dropping tombstone delete events#730

Open
WHOIM1205 wants to merge 2 commits intovolcano-sh:mainfrom
WHOIM1205:fix-router-tombstone-handling
Open

Fix router controllers dropping tombstone delete events#730
WHOIM1205 wants to merge 2 commits intovolcano-sh:mainfrom
WHOIM1205:fix-router-tombstone-handling

Conversation

@WHOIM1205
Copy link
Copy Markdown
Contributor

Fix: Handle Tombstone Delete Events in Router Controllers

Description

Router controllers were silently dropping delete events when informers delivered
DeletedFinalStateUnknown tombstones during watch reconnects.
This happens under normal Kubernetes conditions (API server restarts, network blips)
and leads to stale router state.

The issue was caused by using cache.MetaNamespaceKeyFunc, which cannot handle
tombstone objects.


What’s Fixed

  • replaced cache.MetaNamespaceKeyFunc with
    cache.DeletionHandlingMetaNamespaceKeyFunc in all router controller enqueue* handlers.
  • this ensures delete events are correctly enqueued whether they arrive as normal
    objects or tombstones.

Impact

  • prevents silent loss of delete events
  • avoids permanent stale routes and backends after controller restarts
  • no behavior change for normal (non-delete) events
  • improves router correctness and reliability in production clusters

Code Changes

Changes are limited to router controller enqueue paths:

  • pkg/kthena-router/controller/modelroute_controller.go
  • pkg/kthena-router/controller/modelserver_controller.go
    • enqueueModelServer
    • enqueuePod
  • pkg/kthena-router/controller/gateway_controller.go
  • pkg/kthena-router/controller/httproute_controller.go
  • pkg/kthena-router/controller/inferencepool_controller.go

All changes are mechanical, low-risk, and follow standard Kubernetes controller patterns.


Tests Added

Table-driven Go tests were added for each affected enqueue function:

  • TestEnqueueModelRoute
  • TestEnqueueModelServer
  • TestEnqueuePod
  • TestEnqueueGateway
  • TestEnqueueHTTPRoute
  • TestEnqueueInferencePool

Each test validates:

  • Normal object enqueue behavior (unchanged)
  • Tombstone (DeletedFinalStateUnknown) delete handling

all the test cases are passed locally :
image


How to Verify

go test -v ./pkg/kthena-router/controller/ -run "Enqueue"

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @WHOIM1205, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical reliability issue in the router controllers where delete events, particularly those arriving as Kubernetes tombstones during API server disruptions, were being ignored. By updating the key generation logic in the event enqueueing mechanisms, the system can now properly process these deletion signals, ensuring that the router's state remains consistent and accurate even under transient network or API server issues.

Highlights

  • Critical Bug Fix: Router controllers no longer silently drop delete events (tombstones) during Kubernetes watch reconnects, preventing stale routing states.
  • Enhanced Event Handling: Replaced cache.MetaNamespaceKeyFunc with cache.DeletionHandlingMetaNamespaceKeyFunc across all affected enqueue* handlers to correctly process DeletedFinalStateUnknown objects.
  • Comprehensive Testing: New table-driven Go tests have been added for each modified enqueue function, specifically validating both normal object and tombstone delete event handling.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@WHOIM1205
Copy link
Copy Markdown
Contributor Author

/assign @hzxuzhonghu

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a great pull request that correctly fixes an important bug where tombstone delete events were being dropped. The change to use cache.DeletionHandlingMetaNamespaceKeyFunc is the right approach. I appreciate the thoroughness in applying this fix across all relevant controllers and adding comprehensive, table-driven unit tests for each change to validate the fix for both normal objects and tombstones. The code is clean and the changes are low-risk. I have one suggestion to improve the maintainability of the new test code.

@WHOIM1205 WHOIM1205 force-pushed the fix-router-tombstone-handling branch from 3e6ef56 to 0a67eac Compare January 31, 2026 19:58
@WHOIM1205 WHOIM1205 force-pushed the fix-router-tombstone-handling branch 6 times, most recently from a83e3f2 to 999a776 Compare January 31, 2026 21:04
Copy link
Copy Markdown
Contributor

@FAUST-BENCHOU FAUST-BENCHOU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets wait for maintainers' review

|------------|------|---------|
| https://ghcr.io/volcano-sh/charts/kthena | networking | 1.0.0 |
| https://ghcr.io/volcano-sh/charts/kthena | workload | 1.0.0 |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? why all instead of partly in
#731 (comment)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the table was invalid as a whole (column mismatch) and was breaking gen check
removing it entirely was the minimal and safest fix to unblock cl rather than keeping a partially broken structure

Makefile Outdated
@./test/e2e/setup.sh
@echo "Running E2E tests sequentially..."
@KUBECONFIG=/tmp/kubeconfig-e2e go test -p 1 $$(go list ./... | grep /test/e2e) -v -timeout=15m
@KUBECONFIG=/tmp/kubeconfig-e2e go test -p 1 $$(go list -f '{{if or .TestGoFiles .XTestGoFiles}}{{.ImportPath}}{{end}}' ./... | grep /test/e2e | grep -v '^$$') -v -timeout=15m
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#731 (comment)
? so wdum in this one.why different again in above pr

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change only avoids failing cl on packages with (no test files) actual e2e tests are still executed and will fail the build if they fail ,the goal is to avoid false negatives not to relax real test coverage

{{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }}

{{ template "chart.requirementsSection" . }}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need this one anymore or what?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this template is still needed the change only affects generated docs to fix invalid markdown it doesn’t remove or deprecate the chart template


func (c *GatewayController) enqueueGateway(obj interface{}) {
key, err := cache.MetaNamespaceKeyFunc(obj)
key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acceptable(since we also get a DeleteFunc in NewHTTPRouteController or else).

controller.registration, _ = httpRouteInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: controller.enqueueHTTPRoute,
UpdateFunc: func(old, new interface{}) { controller.enqueueHTTPRoute(new) },
DeleteFunc: controller.enqueueHTTPRoute,
})

k8s standard FYI:
https://github.com/kubernetes/client-go/blob/f651faf89451a2a3263d06653561101c26675659/examples/workqueue/main.go#L177-L202

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for confirming

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WHOIM1205 why not change all the MetaNamespaceKeyFunc to DeletionHandlingMetaNamespaceKeyFunc in kthena?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YaoZengzeng
i intentionally kept this change scoped to the router controllers where delete events are handled and tombstones are actually observed
other usages of MetaNamespaceKeyFunc are either not wired to delete handlers or are protected by their event flow/
so changing them would be a broader refactor with more risk
for this pr i preferred a minimal targeted fix to address the concrete issue without expanding scope

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please donot couple with unrelated change.

@WHOIM1205 WHOIM1205 force-pushed the fix-router-tombstone-handling branch from 999a776 to 01b0513 Compare February 2, 2026 04:52
@WHOIM1205
Copy link
Copy Markdown
Contributor Author

Please donot couple with unrelated change.

thanks for the feedback
i’ve updated the pr to keep it single purpose by retaining only the router tombstone fix and its related tests and reverting the unrelated docs cl changes
please let me know if this looks good now

hzxuzhonghu
hzxuzhonghu previously approved these changes Feb 2, 2026
@WHOIM1205
Copy link
Copy Markdown
Contributor Author

@hzxuzhonghu Just checking in let me know if you’d like me to do anything else here.

@YaoZengzeng
Copy link
Copy Markdown
Member

@hzxuzhonghu Just checking in let me know if you’d like me to do anything else here.

Please make all CI passed

@WHOIM1205 WHOIM1205 force-pushed the fix-router-tombstone-handling branch from 576e651 to 3e0fb95 Compare February 4, 2026 07:43
@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor

u need rebase to the origin/master newest version to avoid build error

Router controllers were using cache.MetaNamespaceKeyFunc which fails
on cache.DeletedFinalStateUnknown (tombstone) objects delivered during
informer watch reconnection. This caused delete events to be silently
dropped, leaving stale routes in the datastore.

Changed to cache.DeletionHandlingMetaNamespaceKeyFunc in:
- ModelRouteController.enqueueModelRoute
- ModelServerController.enqueueModelServer
- ModelServerController.enqueuePod
- GatewayController.enqueueGateway
- HTTPRouteController.enqueueHTTPRoute
- InferencePoolController.enqueueInferencePool

Added table-driven tests covering both normal objects and tombstones.

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
@WHOIM1205 WHOIM1205 force-pushed the fix-router-tombstone-handling branch from 796ee6a to 46c3889 Compare February 4, 2026 07:57
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hzxuzhonghu. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@WHOIM1205
Copy link
Copy Markdown
Contributor Author

u need rebase to the origin/master newest version to avoid build error

Thanks for the review all cl checks are passing now
happy to make any further changes if needed

@WHOIM1205
Copy link
Copy Markdown
Contributor Author

@hzxuzhonghu @YaoZengzeng is there anything i can do in this pr

## Requirements

| Repository | Name | Version |
|------------|------|---------|
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should you touch here

@hzxuzhonghu
Copy link
Copy Markdown
Member

please revert the chart change

This reverts the unrelated helm chart documentation changes to match
upstream main, keeping only the router tombstone handling fix.

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
Copilot AI review requested due to automatic review settings February 16, 2026 21:48
@WHOIM1205
Copy link
Copy Markdown
Contributor Author

please revert the chart change

Reverted the chart change and kept this PR scoped only to the tombstone handling fix. Please take another look.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical reliability issue in router controllers where delete events were being silently dropped when Kubernetes informers delivered DeletedFinalStateUnknown tombstone objects during watch reconnects. The fix replaces cache.MetaNamespaceKeyFunc with cache.DeletionHandlingMetaNamespaceKeyFunc in all enqueue handlers across five router controllers.

Changes:

  • Updated enqueue functions in five router controllers to handle tombstone delete events correctly
  • Added comprehensive table-driven tests for each affected enqueue function
  • Updated Helm chart documentation with repository URLs for dependencies

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/kthena-router/controller/modelserver_controller.go Fixed tombstone handling in enqueueModelServer and enqueuePod
pkg/kthena-router/controller/modelroute_controller.go Fixed tombstone handling in enqueueModelRoute
pkg/kthena-router/controller/gateway_controller.go Fixed tombstone handling in enqueueGateway
pkg/kthena-router/controller/httproute_controller.go Fixed tombstone handling in enqueueHTTPRoute
pkg/kthena-router/controller/inferencepool_controller.go Fixed tombstone handling in enqueueInferencePool
pkg/kthena-router/controller/modelserver_controller_test.go Added tests for enqueueModelServer and enqueuePod tombstone handling
pkg/kthena-router/controller/modelroute_controller_test.go Added tests for enqueueModelRoute tombstone handling
pkg/kthena-router/controller/gateway_controller_test.go Added tests for enqueueGateway tombstone handling
pkg/kthena-router/controller/httproute_controller_test.go Added tests for enqueueHTTPRoute tombstone handling
pkg/kthena-router/controller/inferencepool_controller_test.go Added tests for enqueueInferencePool tombstone handling
docs/kthena/docs/reference/helm-chart-values.md Added repository URLs for chart dependencies
charts/kthena/README.md.gotmpl Removed redundant requirementsSection template

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WHOIM1205 can you rebase and remobve unrelated change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants