fix(e2e): restore waitForWebhookReady in TestWebhook after controller-manager restart#806
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves an intermittent failure in the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a waitForWebhookReady helper function to fix a race condition in E2E tests where the webhook server might not be ready after a controller-manager restart. The implementation is sound, but I've suggested a refactoring to use a standard Kubernetes utility function for polling, which improves code clarity and robustness.
| func waitForWebhookReady(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, timeout time.Duration) { | ||
| t.Helper() | ||
| deadline := time.Now().Add(timeout) | ||
| probe := createValidModelBoosterForWebhookTest() | ||
| for time.Now().Before(deadline) { | ||
| _, err := kthenaClient.WorkloadV1alpha1().ModelBoosters(testNamespace).Create(ctx, probe, metav1.CreateOptions{DryRun: []string{"All"}}) | ||
| if err == nil { | ||
| return | ||
| } | ||
| if !strings.Contains(err.Error(), "connect: connection refused") { | ||
| t.Fatalf("Webhook probe failed: %v", err) | ||
| } | ||
| t.Logf("Webhook not ready, retrying: %v", err) | ||
| time.Sleep(2 * time.Second) | ||
| } | ||
| t.Fatal("Webhook did not become ready within timeout") | ||
| } |
There was a problem hiding this comment.
For better readability and to use idiomatic patterns from the Kubernetes ecosystem, consider refactoring this polling loop to use wait.PollUntilContextTimeout from the k8s.io/apimachinery/pkg/util/wait package. This function is designed for this exact purpose and makes the intent of the code clearer, while also handling timeout and cancellation more robustly.
You'll need to add "k8s.io/apimachinery/pkg/util/wait" to your imports.
| func waitForWebhookReady(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, timeout time.Duration) { | |
| t.Helper() | |
| deadline := time.Now().Add(timeout) | |
| probe := createValidModelBoosterForWebhookTest() | |
| for time.Now().Before(deadline) { | |
| _, err := kthenaClient.WorkloadV1alpha1().ModelBoosters(testNamespace).Create(ctx, probe, metav1.CreateOptions{DryRun: []string{"All"}}) | |
| if err == nil { | |
| return | |
| } | |
| if !strings.Contains(err.Error(), "connect: connection refused") { | |
| t.Fatalf("Webhook probe failed: %v", err) | |
| } | |
| t.Logf("Webhook not ready, retrying: %v", err) | |
| time.Sleep(2 * time.Second) | |
| } | |
| t.Fatal("Webhook did not become ready within timeout") | |
| } | |
| func waitForWebhookReady(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, timeout time.Duration) { | |
| t.Helper() | |
| probe := createValidModelBoosterForWebhookTest() | |
| err := wait.PollUntilContextTimeout(ctx, 2*time.Second, timeout, true, func(ctx context.Context) (bool, error) { | |
| _, err := kthenaClient.WorkloadV1alpha1().ModelBoosters(testNamespace).Create(ctx, probe, metav1.CreateOptions{DryRun: []string{"All"}}) | |
| if err == nil { | |
| return true, nil // Webhook is ready | |
| } | |
| if strings.Contains(err.Error(), "connect: connection refused") { | |
| t.Logf("Webhook not ready, retrying: %v", err) | |
| return false, nil // Continue polling | |
| } | |
| return false, err // Stop polling with an unexpected error | |
| }) | |
| if err != nil { | |
| t.Fatalf("Webhook did not become ready: %v", err) | |
| } | |
| } |
|
PTAL @hzxuzhonghu. The e2e were failing: https://github.com/volcano-sh/kthena/actions/runs/22984670979 |
…-manager restart Signed-off-by: Rajesh Agrawal <rajeshagrawal38000@gmail.com>
7576753 to
50b7306
Compare
|
We have met it before in fact and it should have fixed in this pr #770 , |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
This approach looks directionally correct to me. It probes actual webhook behavior with a dry-run create instead of relying on pod/endpoint readiness, which matches the failure mode described in the issue much more closely.
It also lines up with the earlier history on this test: the flake was about the webhook listener not accepting connections yet, so validating the webhook path itself is the right readiness signal.
Problem
TestWebhookwas failing withconnect: connection refusedbecauseTestModelServingControllerManagerRestart(which runs before it) kills and restarts the controller-manager pod, briefly taking down the webhook server.Fix
Restored the
waitForWebhookReadyhelper that polls the webhook until it accepts connections, and added a call at the start ofTestWebhookwith a 2-minute timeout.