Skip to content

fix(e2e): restore waitForWebhookReady in TestWebhook after controller-manager restart#806

Open
agrawalcodes wants to merge 1 commit intovolcano-sh:mainfrom
agrawalcodes:fix/webhook-wait-after-restart
Open

fix(e2e): restore waitForWebhookReady in TestWebhook after controller-manager restart#806
agrawalcodes wants to merge 1 commit intovolcano-sh:mainfrom
agrawalcodes:fix/webhook-wait-after-restart

Conversation

@agrawalcodes
Copy link
Copy Markdown
Contributor

Problem

TestWebhook was failing with connect: connection refused because TestModelServingControllerManagerRestart (which runs before it) kills and restarts the controller-manager pod, briefly taking down the webhook server.

Fix

Restored the waitForWebhookReady helper that polls the webhook until it accepts connections, and added a call at the start of TestWebhook with a 2-minute timeout.

Copilot AI review requested due to automatic review settings March 12, 2026 03:44
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an intermittent failure in the TestWebhook end-to-end test. The test was prone to failing with connection refused errors because a preceding test (TestModelServingControllerManagerRestart) would restart the controller-manager, temporarily taking the webhook server offline. The fix ensures that TestWebhook waits for the webhook to be fully ready before proceeding, preventing these transient connection issues.

Highlights

  • Flaky E2E Test Fix: Addressed the TestWebhook flakiness caused by connection refused errors after the controller-manager restart by reintroducing and utilizing the waitForWebhookReady helper.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • test/e2e/controller-manager/webhook_test.go
    • Added a new helper function waitForWebhookReady to poll the webhook until it accepts connections.
    • Integrated waitForWebhookReady into TestWebhook with a 2-minute timeout to ensure webhook readiness before test execution.
    • Imported context, strings, and clientset packages to support the new functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a waitForWebhookReady helper function to fix a race condition in E2E tests where the webhook server might not be ready after a controller-manager restart. The implementation is sound, but I've suggested a refactoring to use a standard Kubernetes utility function for polling, which improves code clarity and robustness.

Comment on lines +33 to +49
func waitForWebhookReady(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, timeout time.Duration) {
t.Helper()
deadline := time.Now().Add(timeout)
probe := createValidModelBoosterForWebhookTest()
for time.Now().Before(deadline) {
_, err := kthenaClient.WorkloadV1alpha1().ModelBoosters(testNamespace).Create(ctx, probe, metav1.CreateOptions{DryRun: []string{"All"}})
if err == nil {
return
}
if !strings.Contains(err.Error(), "connect: connection refused") {
t.Fatalf("Webhook probe failed: %v", err)
}
t.Logf("Webhook not ready, retrying: %v", err)
time.Sleep(2 * time.Second)
}
t.Fatal("Webhook did not become ready within timeout")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and to use idiomatic patterns from the Kubernetes ecosystem, consider refactoring this polling loop to use wait.PollUntilContextTimeout from the k8s.io/apimachinery/pkg/util/wait package. This function is designed for this exact purpose and makes the intent of the code clearer, while also handling timeout and cancellation more robustly.

You'll need to add "k8s.io/apimachinery/pkg/util/wait" to your imports.

Suggested change
func waitForWebhookReady(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, timeout time.Duration) {
t.Helper()
deadline := time.Now().Add(timeout)
probe := createValidModelBoosterForWebhookTest()
for time.Now().Before(deadline) {
_, err := kthenaClient.WorkloadV1alpha1().ModelBoosters(testNamespace).Create(ctx, probe, metav1.CreateOptions{DryRun: []string{"All"}})
if err == nil {
return
}
if !strings.Contains(err.Error(), "connect: connection refused") {
t.Fatalf("Webhook probe failed: %v", err)
}
t.Logf("Webhook not ready, retrying: %v", err)
time.Sleep(2 * time.Second)
}
t.Fatal("Webhook did not become ready within timeout")
}
func waitForWebhookReady(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, timeout time.Duration) {
t.Helper()
probe := createValidModelBoosterForWebhookTest()
err := wait.PollUntilContextTimeout(ctx, 2*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
_, err := kthenaClient.WorkloadV1alpha1().ModelBoosters(testNamespace).Create(ctx, probe, metav1.CreateOptions{DryRun: []string{"All"}})
if err == nil {
return true, nil // Webhook is ready
}
if strings.Contains(err.Error(), "connect: connection refused") {
t.Logf("Webhook not ready, retrying: %v", err)
return false, nil // Continue polling
}
return false, err // Stop polling with an unexpected error
})
if err != nil {
t.Fatalf("Webhook did not become ready: %v", err)
}
}

@agrawalcodes
Copy link
Copy Markdown
Contributor Author

…-manager restart

Signed-off-by: Rajesh Agrawal <rajeshagrawal38000@gmail.com>
@agrawalcodes agrawalcodes force-pushed the fix/webhook-wait-after-restart branch from 7576753 to 50b7306 Compare March 12, 2026 03:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor

We have met it before in fact and it should have fixed in this pr #770 ,
FYI: #758
and for some reason in https://github.com/volcano-sh/kthena/pull/756/changes#diff-b8c63a21f94d65b6bd68aa43dbda9f403b138994620af22bc574c5b7692dcc3b these changes are reverted.
In fact, you can simply undo the changes here.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks directionally correct to me. It probes actual webhook behavior with a dry-run create instead of relying on pod/endpoint readiness, which matches the failure mode described in the issue much more closely.

It also lines up with the earlier history on this test: the flake was about the webhook listener not accepting connections yet, so validating the webhook path itself is the right readiness signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants