Fix router controllers dropping tombstone delete events#730
Fix router controllers dropping tombstone delete events#730WHOIM1205 wants to merge 2 commits intovolcano-sh:mainfrom
Conversation
Summary of ChangesHello @WHOIM1205, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical reliability issue in the router controllers where delete events, particularly those arriving as Kubernetes tombstones during API server disruptions, were being ignored. By updating the key generation logic in the event enqueueing mechanisms, the system can now properly process these deletion signals, ensuring that the router's state remains consistent and accurate even under transient network or API server issues. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/assign @hzxuzhonghu |
There was a problem hiding this comment.
Code Review
This is a great pull request that correctly fixes an important bug where tombstone delete events were being dropped. The change to use cache.DeletionHandlingMetaNamespaceKeyFunc is the right approach. I appreciate the thoroughness in applying this fix across all relevant controllers and adding comprehensive, table-driven unit tests for each change to validate the fix for both normal objects and tombstones. The code is clean and the changes are low-risk. I have one suggestion to improve the maintainability of the new test code.
3e6ef56 to
0a67eac
Compare
a83e3f2 to
999a776
Compare
FAUST-BENCHOU
left a comment
There was a problem hiding this comment.
lets wait for maintainers' review
| |------------|------|---------| | ||
| | https://ghcr.io/volcano-sh/charts/kthena | networking | 1.0.0 | | ||
| | https://ghcr.io/volcano-sh/charts/kthena | workload | 1.0.0 | | ||
|
|
There was a problem hiding this comment.
the table was invalid as a whole (column mismatch) and was breaking gen check
removing it entirely was the minimal and safest fix to unblock cl rather than keeping a partially broken structure
Makefile
Outdated
| @./test/e2e/setup.sh | ||
| @echo "Running E2E tests sequentially..." | ||
| @KUBECONFIG=/tmp/kubeconfig-e2e go test -p 1 $$(go list ./... | grep /test/e2e) -v -timeout=15m | ||
| @KUBECONFIG=/tmp/kubeconfig-e2e go test -p 1 $$(go list -f '{{if or .TestGoFiles .XTestGoFiles}}{{.ImportPath}}{{end}}' ./... | grep /test/e2e | grep -v '^$$') -v -timeout=15m |
There was a problem hiding this comment.
#731 (comment)
? so wdum in this one.why different again in above pr
There was a problem hiding this comment.
this change only avoids failing cl on packages with (no test files) actual e2e tests are still executed and will fail the build if they fail ,the goal is to avoid false negatives not to relax real test coverage
| {{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }} | ||
|
|
||
| {{ template "chart.requirementsSection" . }} | ||
|
|
There was a problem hiding this comment.
we dont need this one anymore or what?
There was a problem hiding this comment.
this template is still needed the change only affects generated docs to fix invalid markdown it doesn’t remove or deprecate the chart template
|
|
||
| func (c *GatewayController) enqueueGateway(obj interface{}) { | ||
| key, err := cache.MetaNamespaceKeyFunc(obj) | ||
| key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj) |
There was a problem hiding this comment.
acceptable(since we also get a DeleteFunc in NewHTTPRouteController or else).
kthena/pkg/kthena-router/controller/httproute_controller.go
Lines 60 to 64 in a8c9193
k8s standard FYI:
https://github.com/kubernetes/client-go/blob/f651faf89451a2a3263d06653561101c26675659/examples/workqueue/main.go#L177-L202
There was a problem hiding this comment.
thanks for confirming
There was a problem hiding this comment.
@WHOIM1205 why not change all the MetaNamespaceKeyFunc to DeletionHandlingMetaNamespaceKeyFunc in kthena?
There was a problem hiding this comment.
@YaoZengzeng
i intentionally kept this change scoped to the router controllers where delete events are handled and tombstones are actually observed
other usages of MetaNamespaceKeyFunc are either not wired to delete handlers or are protected by their event flow/
so changing them would be a broader refactor with more risk
for this pr i preferred a minimal targeted fix to address the concrete issue without expanding scope
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Please donot couple with unrelated change.
999a776 to
01b0513
Compare
thanks for the feedback |
|
@hzxuzhonghu Just checking in let me know if you’d like me to do anything else here. |
Please make all CI passed |
01b0513 to
b9eb97a
Compare
e7415a5 to
8ac3922
Compare
576e651 to
3e0fb95
Compare
|
u need rebase to the origin/master newest version to avoid build error |
Router controllers were using cache.MetaNamespaceKeyFunc which fails on cache.DeletedFinalStateUnknown (tombstone) objects delivered during informer watch reconnection. This caused delete events to be silently dropped, leaving stale routes in the datastore. Changed to cache.DeletionHandlingMetaNamespaceKeyFunc in: - ModelRouteController.enqueueModelRoute - ModelServerController.enqueueModelServer - ModelServerController.enqueuePod - GatewayController.enqueueGateway - HTTPRouteController.enqueueHTTPRoute - InferencePoolController.enqueueInferencePool Added table-driven tests covering both normal objects and tombstones. Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
796ee6a to
46c3889
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Thanks for the review all cl checks are passing now |
|
@hzxuzhonghu @YaoZengzeng is there anything i can do in this pr |
| ## Requirements | ||
|
|
||
| | Repository | Name | Version | | ||
| |------------|------|---------| |
|
please revert the chart change |
This reverts the unrelated helm chart documentation changes to match upstream main, keeping only the router tombstone handling fix. Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
Reverted the chart change and kept this PR scoped only to the tombstone handling fix. Please take another look. |
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical reliability issue in router controllers where delete events were being silently dropped when Kubernetes informers delivered DeletedFinalStateUnknown tombstone objects during watch reconnects. The fix replaces cache.MetaNamespaceKeyFunc with cache.DeletionHandlingMetaNamespaceKeyFunc in all enqueue handlers across five router controllers.
Changes:
- Updated enqueue functions in five router controllers to handle tombstone delete events correctly
- Added comprehensive table-driven tests for each affected enqueue function
- Updated Helm chart documentation with repository URLs for dependencies
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/kthena-router/controller/modelserver_controller.go | Fixed tombstone handling in enqueueModelServer and enqueuePod |
| pkg/kthena-router/controller/modelroute_controller.go | Fixed tombstone handling in enqueueModelRoute |
| pkg/kthena-router/controller/gateway_controller.go | Fixed tombstone handling in enqueueGateway |
| pkg/kthena-router/controller/httproute_controller.go | Fixed tombstone handling in enqueueHTTPRoute |
| pkg/kthena-router/controller/inferencepool_controller.go | Fixed tombstone handling in enqueueInferencePool |
| pkg/kthena-router/controller/modelserver_controller_test.go | Added tests for enqueueModelServer and enqueuePod tombstone handling |
| pkg/kthena-router/controller/modelroute_controller_test.go | Added tests for enqueueModelRoute tombstone handling |
| pkg/kthena-router/controller/gateway_controller_test.go | Added tests for enqueueGateway tombstone handling |
| pkg/kthena-router/controller/httproute_controller_test.go | Added tests for enqueueHTTPRoute tombstone handling |
| pkg/kthena-router/controller/inferencepool_controller_test.go | Added tests for enqueueInferencePool tombstone handling |
| docs/kthena/docs/reference/helm-chart-values.md | Added repository URLs for chart dependencies |
| charts/kthena/README.md.gotmpl | Removed redundant requirementsSection template |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
hzxuzhonghu
left a comment
There was a problem hiding this comment.
@WHOIM1205 can you rebase and remobve unrelated change.
Fix: Handle Tombstone Delete Events in Router Controllers
Description
Router controllers were silently dropping delete events when informers delivered
DeletedFinalStateUnknowntombstones during watch reconnects.This happens under normal Kubernetes conditions (API server restarts, network blips)
and leads to stale router state.
The issue was caused by using
cache.MetaNamespaceKeyFunc, which cannot handletombstone objects.
What’s Fixed
cache.MetaNamespaceKeyFuncwithcache.DeletionHandlingMetaNamespaceKeyFuncin all router controllerenqueue*handlers.objects or tombstones.
Impact
Code Changes
Changes are limited to router controller enqueue paths:
pkg/kthena-router/controller/modelroute_controller.gopkg/kthena-router/controller/modelserver_controller.goenqueueModelServerenqueuePodpkg/kthena-router/controller/gateway_controller.gopkg/kthena-router/controller/httproute_controller.gopkg/kthena-router/controller/inferencepool_controller.goAll changes are mechanical, low-risk, and follow standard Kubernetes controller patterns.
Tests Added
Table-driven Go tests were added for each affected enqueue function:
TestEnqueueModelRouteTestEnqueueModelServerTestEnqueuePodTestEnqueueGatewayTestEnqueueHTTPRouteTestEnqueueInferencePoolEach test validates:
DeletedFinalStateUnknown) delete handlingall the test cases are passed locally :

How to Verify