fix(e2e): restore waitForWebhookReady in TestWebhook after controller-manager restart by agrawalcodes · Pull Request #806 · volcano-sh/kthena

agrawalcodes · 2026-03-12T03:44:01Z

Problem

TestWebhook was failing with connect: connection refused because TestModelServingControllerManagerRestart (which runs before it) kills and restarts the controller-manager pod, briefly taking down the webhook server.

Fix

Restored the waitForWebhookReady helper that polls the webhook until it accepts connections, and added a call at the start of TestWebhook with a 2-minute timeout.

volcano-sh-bot · 2026-03-12T03:44:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2026-03-12T03:44:14Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an intermittent failure in the TestWebhook end-to-end test. The test was prone to failing with connection refused errors because a preceding test (TestModelServingControllerManagerRestart) would restart the controller-manager, temporarily taking the webhook server offline. The fix ensures that TestWebhook waits for the webhook to be fully ready before proceeding, preventing these transient connection issues.

Highlights

Flaky E2E Test Fix: Addressed the TestWebhook flakiness caused by connection refused errors after the controller-manager restart by reintroducing and utilizing the waitForWebhookReady helper.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

test/e2e/controller-manager/webhook_test.go
- Added a new helper function waitForWebhookReady to poll the webhook until it accepts connections.
- Integrated waitForWebhookReady into TestWebhook with a 2-minute timeout to ensure webhook readiness before test execution.
- Imported context, strings, and clientset packages to support the new functionality.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a waitForWebhookReady helper function to fix a race condition in E2E tests where the webhook server might not be ready after a controller-manager restart. The implementation is sound, but I've suggested a refactoring to use a standard Kubernetes utility function for polling, which improves code clarity and robustness.

gemini-code-assist · 2026-03-12T03:45:12Z

test/e2e/controller-manager/webhook_test.go

+func waitForWebhookReady(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, timeout time.Duration) {
+	t.Helper()
+	deadline := time.Now().Add(timeout)
+	probe := createValidModelBoosterForWebhookTest()
+	for time.Now().Before(deadline) {
+		_, err := kthenaClient.WorkloadV1alpha1().ModelBoosters(testNamespace).Create(ctx, probe, metav1.CreateOptions{DryRun: []string{"All"}})
+		if err == nil {
+			return
+		}
+		if !strings.Contains(err.Error(), "connect: connection refused") {
+			t.Fatalf("Webhook probe failed: %v", err)
+		}
+		t.Logf("Webhook not ready, retrying: %v", err)
+		time.Sleep(2 * time.Second)
+	}
+	t.Fatal("Webhook did not become ready within timeout")
+}


For better readability and to use idiomatic patterns from the Kubernetes ecosystem, consider refactoring this polling loop to use wait.PollUntilContextTimeout from the k8s.io/apimachinery/pkg/util/wait package. This function is designed for this exact purpose and makes the intent of the code clearer, while also handling timeout and cancellation more robustly.

You'll need to add "k8s.io/apimachinery/pkg/util/wait" to your imports.

Suggested change

func waitForWebhookReady(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, timeout time.Duration) {

t.Helper()

deadline := time.Now().Add(timeout)

probe := createValidModelBoosterForWebhookTest()

for time.Now().Before(deadline) {

_, err := kthenaClient.WorkloadV1alpha1().ModelBoosters(testNamespace).Create(ctx, probe, metav1.CreateOptions{DryRun: []string{"All"}})

if err == nil {

return

}

if !strings.Contains(err.Error(), "connect: connection refused") {

t.Fatalf("Webhook probe failed: %v", err)

}

t.Logf("Webhook not ready, retrying: %v", err)

time.Sleep(2 * time.Second)

}

t.Fatal("Webhook did not become ready within timeout")

}

func waitForWebhookReady(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, timeout time.Duration) {

t.Helper()

probe := createValidModelBoosterForWebhookTest()

err := wait.PollUntilContextTimeout(ctx, 2*time.Second, timeout, true, func(ctx context.Context) (bool, error) {

_, err := kthenaClient.WorkloadV1alpha1().ModelBoosters(testNamespace).Create(ctx, probe, metav1.CreateOptions{DryRun: []string{"All"}})

if err == nil {

return true, nil // Webhook is ready

}

if strings.Contains(err.Error(), "connect: connection refused") {

t.Logf("Webhook not ready, retrying: %v", err)

return false, nil // Continue polling

}

return false, err // Stop polling with an unexpected error

})

if err != nil {

t.Fatalf("Webhook did not become ready: %v", err)

}

}

agrawalcodes · 2026-03-12T03:45:24Z

PTAL @hzxuzhonghu. The e2e were failing: https://github.com/volcano-sh/kthena/actions/runs/22984670979

…-manager restart Signed-off-by: Rajesh Agrawal <rajeshagrawal38000@gmail.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

FAUST-BENCHOU · 2026-03-12T06:28:23Z

We have met it before in fact and it should have fixed in this pr #770 ,
FYI: #758
and for some reason in https://github.com/volcano-sh/kthena/pull/756/changes#diff-b8c63a21f94d65b6bd68aa43dbda9f403b138994620af22bc574c5b7692dcc3b these changes are reverted.
In fact, you can simply undo the changes here.

hzxuzhonghu

This approach looks directionally correct to me. It probes actual webhook behavior with a dry-run create instead of relying on pod/endpoint readiness, which matches the failure mode described in the issue much more closely.

It also lines up with the earlier history on this test: the flake was about the webhook listener not accepting connections yet, so validating the webhook path itself is the right readiness signal.

Copilot AI review requested due to automatic review settings March 12, 2026 03:44

volcano-sh-bot requested review from YaoZengzeng and hzxuzhonghu March 12, 2026 03:44

volcano-sh-bot added the size/S label Mar 12, 2026

Copilot started reviewing on behalf of agrawalcodes March 12, 2026 03:44 View session

gemini-code-assist bot reviewed Mar 12, 2026

View reviewed changes

fix(e2e): restore waitForWebhookReady in TestWebhook after controller…

50b7306

…-manager restart Signed-off-by: Rajesh Agrawal <rajeshagrawal38000@gmail.com>

agrawalcodes force-pushed the fix/webhook-wait-after-restart branch from 7576753 to 50b7306 Compare March 12, 2026 03:46

Copilot AI reviewed Mar 12, 2026

View reviewed changes

hzxuzhonghu reviewed Mar 16, 2026

View reviewed changes

FAUST-BENCHOU mentioned this pull request Mar 16, 2026

fix wait for webhook to be ready before running webhook tests #816

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): restore waitForWebhookReady in TestWebhook after controller-manager restart#806

fix(e2e): restore waitForWebhookReady in TestWebhook after controller-manager restart#806
agrawalcodes wants to merge 1 commit intovolcano-sh:mainfrom
agrawalcodes:fix/webhook-wait-after-restart

agrawalcodes commented Mar 12, 2026

Uh oh!

volcano-sh-bot commented Mar 12, 2026

Uh oh!

gemini-code-assist bot commented Mar 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 12, 2026

Uh oh!

agrawalcodes commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

FAUST-BENCHOU commented Mar 12, 2026

Uh oh!

hzxuzhonghu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

agrawalcodes commented Mar 12, 2026

Problem

Fix

Uh oh!

volcano-sh-bot commented Mar 12, 2026

Uh oh!

gemini-code-assist bot commented Mar 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

agrawalcodes commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

FAUST-BENCHOU commented Mar 12, 2026

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants