Skip to content

[WIP] feat(hypernode): extract standalone module and controller image#5137

Open
wangyang0616 wants to merge 1 commit intovolcano-sh:masterfrom
wangyang0616:feature_controller_independent
Open

[WIP] feat(hypernode): extract standalone module and controller image#5137
wangyang0616 wants to merge 1 commit intovolcano-sh:masterfrom
wangyang0616:feature_controller_independent

Conversation

@wangyang0616
Copy link
Copy Markdown
Member

What type of PR is this?

Feature

What this PR does / why we need it:

  • Extract HyperNode controller implementation into an independent Go module volcano.sh/hypernode under staging/src/volcano.sh/hypernode/.
  • Add an optional standalone binary vc-hypernode-controller to compile/deploy HyperNode without the rest of vc-controller-manager.
  • Keep existing controller-manager behavior via a thin adapter pkg/controllers/hypernode/register.go.
  • Add official OCI image build for vc-hypernode-controller and include it in CI builds and release publishing.
  • Align selector-to-node matching semantics between controlle

For details of the proposal, please see: #5136

Which issue(s) this PR fixes:

Fixes #5133

Special notes for your reviewer:

The current pull request only addresses the HyperNode controller; podgroup-related content will be updated in a separate pull request.

Does this PR introduce a user-facing change?

HyperNode controller can now be built and released as a standalone binary/image (vc-hypernode-controller) from module volcano.sh/hypernode, while keeping default controller-manager behavior unchanged.

Copilot AI review requested due to automatic review settings March 30, 2026 07:34
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from wangyang0616. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 30, 2026
@wangyang0616 wangyang0616 force-pushed the feature_controller_independent branch from 519cd65 to 45433d2 Compare March 30, 2026 07:35
@wangyang0616 wangyang0616 changed the title feat(hypernode): extract standalone module and controller image [WIP] feat(hypernode): extract standalone module and controller image Mar 30, 2026
@volcano-sh-bot volcano-sh-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extracts the HyperNode controller into a standalone Go module (volcano.sh/hypernode) and adds a standalone controller binary/image (vc-hypernode-controller), while keeping the existing vc-controller-manager behavior via an adapter and aligning selector-to-node matching between controller and scheduler.

Changes:

  • Introduce staging/src/volcano.sh/hypernode module with standalone controller entrypoint, shared utilities (signals, nodematch, testutil), and documentation.
  • Wire the root repo to the new module and add a controller-manager adapter registration.
  • Add build targets and an OCI image Dockerfile for vc-hypernode-controller, plus docs updates describing deployment modes.

Reviewed changes

Copilot reviewed 39 out of 44 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
staging/src/volcano.sh/hypernode/pkg/testutil/build_hypernode.go Adds module-local HyperNode test builder helper.
staging/src/volcano.sh/hypernode/pkg/signals/signal_windows.go Windows shutdown signal list for standalone controller.
staging/src/volcano.sh/hypernode/pkg/signals/signal_posix.go POSIX shutdown signal list for standalone controller.
staging/src/volcano.sh/hypernode/pkg/signals/signal.go Adds signal-driven context cancellation helper.
staging/src/volcano.sh/hypernode/pkg/nodematch/nodematch.go Extracts shared selector→node matching logic for controller/scheduler.
staging/src/volcano.sh/hypernode/pkg/hypernode/options.go Defines controller initialization options for module consumers.
staging/src/volcano.sh/hypernode/go.sum Adds module dependency lockfile.
staging/src/volcano.sh/hypernode/go.mod Declares new nested module and dependencies/replaces.
staging/src/volcano.sh/hypernode/cmd/hypernode-controller/main.go Adds standalone controller binary entrypoint.
staging/src/volcano.sh/hypernode/cmd/hypernode-controller/app/server.go Implements standalone run loop + leader election + informers.
staging/src/volcano.sh/hypernode/cmd/hypernode-controller/app/options/options.go Adds flags and option parsing for standalone controller.
staging/src/volcano.sh/hypernode/README.md Documents module layout, build/test, and references.
pkg/scheduler/api/hyper_node_info.go Switches scheduler GetMembers to shared nodematch implementation.
pkg/controllers/hypernode/register.go Adds adapter to register module controller into controller-manager.
pkg/controllers/hypernode/hypernode_handler_test.go Updates tests to use module-local testutil and new Controller type.
pkg/controllers/hypernode/hypernode_handler.go Aligns controller member resolution with shared nodematch.
pkg/controllers/hypernode/hypernode_controller_test.go Updates controller tests for extracted module packages and Options type.
pkg/controllers/hypernode/hypernode_controller.go Refactors controller into exported Controller with NewController, Name, Initialize.
pkg/controllers/hypernode/discovery/ufm/ufm_test.go Updates imports to module-local api.
pkg/controllers/hypernode/discovery/ufm/ufm.go Updates imports to module-local api/utils.
pkg/controllers/hypernode/discovery/manager_test.go Updates imports to module-local discovery/config/api.
pkg/controllers/hypernode/discovery/manager.go Updates imports and blank imports for module discoverers.
pkg/controllers/hypernode/discovery/label/label_test.go Updates imports to module-local api/utils.
pkg/controllers/hypernode/discovery/label/label.go Updates imports to module-local api/utils.
pkg/controllers/hypernode/discovery/fake/fake.go Updates imports to module-local api.
pkg/controllers/hypernode/configmap_handler_test.go Updates imports and controller type rename.
pkg/controllers/hypernode/configmap_handler.go Updates imports and controller receiver/type rename.
pkg/controllers/hypernode/config/loader_test.go Updates imports to module-local api.
pkg/controllers/hypernode/config/loader.go Updates imports to module-local api.
pkg/controllers/hypernode/config/fake.go Updates imports to module-local api.
installer/dockerfile/hypernode-controller/Dockerfile Adds Dockerfile to build/publish standalone controller image.
installer/README.md Notes optional standalone HyperNode controller image.
go.mod Adds volcano.sh/hypernode dependency and replace to staging module.
docs/user-guide/how_to_use_hypernode_auto_discovery.md Documents bundled vs standalone controller modes and verification logs.
docs/design/hypernode-standalone-controller.zh.md Adds Chinese summary doc for standalone controller/module.
docs/design/hypernode-standalone-controller.md Adds design doc describing module layout, deploy modes, build/image.
docs/design/hyperNode-auto-discovery.md Documents new implementation location and shared node matching.
docs/design/Network Topology Aware Scheduling.md Updates overview text and links shared node matching + module location.
README.md Mentions optional standalone HyperNode controller image/binary and conflict warning.
Makefile Adds build/image targets for vc-hypernode-controller and includes it in images/unit-test/lint.
Comments suppressed due to low confidence (3)

staging/src/volcano.sh/hypernode/pkg/signals/signal.go:1

  • SetupSignalContext() will panic if called more than once because it unconditionally closes onlyOneSignalHandler without any guard, and it also stores shutdownHandler in a package-level variable. If this package is used from tests or multiple binaries/libraries, that can cause flaky panics/data races. Use a sync.Once (or similar) to enforce one-time initialization, and keep the signal channel local to the function (or protect shared state) to avoid races.
    staging/src/volcano.sh/hypernode/pkg/testutil/build_hypernode.go:1
  • Returning nil on an invalid MemberConfig.Selector makes test failures harder to diagnose (it can turn into a nil dereference far away from the cause). Prefer returning (*HyperNode, error) so tests can assert on the error, or (since this is testutil) consider panicking with a clear message to fail fast and localize the failure.
    staging/src/volcano.sh/hypernode/pkg/nodematch/nodematch.go:1
  • This introduces shared matching logic used by both controller and scheduler, but there are no unit tests for the new NodeNamesForSelector. Add focused tests covering: exact-match (including empty name), regex-match (including invalid regex), label-match (including empty selector), and combinations/overlap semantics to prevent future drift between components.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +56 to +61
if s.CaCertFile != "" && s.CertFile != "" && s.KeyFile != "" {
if err := s.ReadCAFiles(); err != nil {
fmt.Fprintf(os.Stderr, "Failed to read TLS files: %v\n", err)
os.Exit(1)
}
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLS flag validation is currently skipped when only some of --ca-cert-file/--tls-cert-file/--tls-private-key-file are provided, because ReadCAFiles() is only called when all three are non-empty. This can lead to silently ignoring user TLS intent. Call ReadCAFiles() when any of the three file flags is set (let it enforce the 'all three required' rule), or move that validation into CheckOptionOrDie().

Copilot uses AI. Check for mistakes.
Comment on lines +112 to +129
func buildRESTConfig(opt *options.ServerOption) (*rest.Config, error) {
loadingRules := clientcmd.NewDefaultClientConfigLoadingRules()
if opt.KubeClientOptions.KubeConfig != "" {
loadingRules.ExplicitPath = opt.KubeClientOptions.KubeConfig
}
overrides := &clientcmd.ConfigOverrides{}
if opt.KubeClientOptions.Master != "" {
overrides.ClusterInfo.Server = opt.KubeClientOptions.Master
}
cc := clientcmd.NewNonInteractiveDeferredLoadingClientConfig(loadingRules, overrides)
cfg, err := cc.ClientConfig()
if err != nil {
return nil, err
}
cfg.QPS = opt.KubeClientOptions.QPS
cfg.Burst = opt.KubeClientOptions.Burst
return cfg, nil
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildRESTConfig uses only kubeconfig-based loading and does not fall back to in-cluster configuration when --kubeconfig/--master are unset. This will break the common 'run in cluster with ServiceAccount' deployment path described in docs. Consider using clientcmd.BuildConfigFromFlags(master, kubeconfig) (which includes in-cluster fallback) or explicitly try rest.InClusterConfig() when no explicit kubeconfig/master is provided.

Copilot uses AI. Check for mistakes.
Comment on lines +61 to +64
if !opt.LeaderElection.LeaderElect {
run(ctx)
return fmt.Errorf("finished without leader elect")
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run() returns an error on normal shutdown in both leader-election-disabled mode (finished without leader elect) and leader-election-enabled mode (lost lease) because the return happens unconditionally after the run loop completes. This will surface as an 'exited with err' log even on SIGTERM/SIGINT. Return nil when the context is canceled (normal shutdown), and only return an error for actual exceptional cases (e.g., inability to start, or explicit lease-loss handling if you choose not to Fatalf).

Copilot uses AI. Check for mistakes.
Comment on lines +97 to +110
leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
Lock: rl,
LeaseDuration: opt.LeaderElection.LeaseDuration.Duration,
RenewDeadline: opt.LeaderElection.RenewDeadline.Duration,
RetryPeriod: opt.LeaderElection.RetryPeriod.Duration,
Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStoppedLeading: func() {
klog.Fatalf("leaderelection lost")
},
},
})
return fmt.Errorf("lost lease")
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run() returns an error on normal shutdown in both leader-election-disabled mode (finished without leader elect) and leader-election-enabled mode (lost lease) because the return happens unconditionally after the run loop completes. This will surface as an 'exited with err' log even on SIGTERM/SIGINT. Return nil when the context is canceled (normal shutdown), and only return an error for actual exceptional cases (e.g., inability to start, or explicit lease-loss handling if you choose not to Fatalf).

Copilot uses AI. Check for mistakes.
ADD . volcano
RUN cd volcano && make vc-hypernode-controller

FROM alpine:latest
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using alpine:latest makes builds non-reproducible and can introduce breaking changes or unexpected CVEs as the base image updates. Pin to a specific Alpine version (or use the repo’s standard runtime base), and consider whether the runtime image should include ca-certificates if the controller needs to make TLS connections relying on system roots.

Suggested change
FROM alpine:latest
FROM alpine:3.20.1
RUN apk add --no-cache ca-certificates

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the HyperNode controller into a standalone Go module (volcano.sh/hypernode) and introduces the vc-hypernode-controller binary for independent deployment. Key changes include updating the build system, adding a Dockerfile, and centralizing node matching logic in a shared package used by both the controller and the scheduler. I have no feedback to provide.

Extract HyperNode controller to staging module volcano.sh/hypernode with
optional vc-hypernode-controller binary and OCI image. Controller-manager
keeps behavior via pkg/controllers/hypernode/register.go.

- Add nodematch package; scheduler GetMembers delegates for consistent semantics
- Makefile: vc-hypernode-controller, image targets, unit-test and lint for submodule
- Dockerfile installer/dockerfile/hypernode-controller for release/CI
- Design docs, user guide and README updates

Signed-off-by: wangyang0616 <[email protected]>
@wangyang0616 wangyang0616 force-pushed the feature_controller_independent branch from 45433d2 to 5f9998b Compare March 30, 2026 07:46
@hzxuzhonghu
Copy link
Copy Markdown
Member

Direction looks reasonable, but I see a few merge blockers before this is ready:

  1. The standalone controller config-loading path needs to preserve the normal in-cluster deployment behavior.
  2. Normal shutdown currently appears to be returned/logged as an error.
  3. TLS file validation is incomplete when only some TLS flags are set.
  4. The image uses alpine:latest, which is too loose for a release component.
  5. Fixes #5133 is too broad for a HyperNode-only PR; this should not close the umbrella issue.

Positives:

  1. Sharing selector-to-node matching between controller and scheduler is the right design.
  2. Keeping the existing hyperNode-controller gate name preserves compatibility.
  3. The adapter approach into controller-manager is a pragmatic incremental extraction.

If the operational issues above are fixed, I think the architectural direction is reviewable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Decoupling HyperNode and PodGroup Controllers: Evolving Toward a Diversified AI Orchestration Ecosystem

4 participants