Skip to content

Add Nebius AI Cloud provider for cluster-autoscaler#9522

Draft
wemoveon2 wants to merge 9 commits intokubernetes:masterfrom
wemoveon2:nebius-provider-upstream-pr
Draft

Add Nebius AI Cloud provider for cluster-autoscaler#9522
wemoveon2 wants to merge 9 commits intokubernetes:masterfrom
wemoveon2:nebius-provider-upstream-pr

Conversation

@wemoveon2
Copy link
Copy Markdown

@wemoveon2 wemoveon2 commented Apr 20, 2026

Summary

Adds a new Nebius AI Cloud provider for cluster-autoscaler, supporting MK8S (managed Kubernetes) node group autoscaling via the Nebius SDK (gosdk v0.1.0).

What's included

  • cloudprovider/nebius/ — Full CloudProvider and NodeGroup interface implementations
  • Node group discovery — Lists node groups from the Nebius MK8S API with pagination support
  • Instance caching — Maps compute instances to node groups via nebius.com/node-group-id labels
  • Scaling — Sets target size by patching FixedNodeCount on the node group
  • Node deletion — Deletes specific instances via the Nebius Compute API
  • Min/max bounds — Read from the MK8S autoscaling spec
  • Concurrency safety — Mutex-protected node group cache
  • Builder registration — Standalone build tag (nebius) and all-providers build
  • Unit tests — 31 tests covering config, refresh, pagination, scaling, deletion, node group lookup

Config

Credentials via JSON config file or env vars:

Env var Purpose
NEBIUS_IAM_TOKEN Authentication
NEBIUS_CLUSTER_ID MK8S cluster to manage
NEBIUS_PARENT_ID Folder containing compute instances

Known limitations

  • TemplateNodeInfo not implemented — Scale-up simulation is not yet supported. The Nebius MK8S API does not expose node templates.
  • setNodeGroupSize uses FixedNodeCount — The Nebius API uses a oneOf for size (Autoscaling or FixedNodeCount) with no desired-count field in the autoscaling spec. Setting a specific target requires switching to fixed mode.
  • ListInstances fetches all instances — The Nebius API does not support label filtering on instance listing. Client-side filtering is applied.
Add Nebius AI Cloud (MK8S) cloud provider for cluster-autoscaler

Implements CloudProvider interface for Nebius AI Cloud managed
Kubernetes (MK8S). Supports node group discovery, instance caching,
and scaling via the Nebius SDK (gosdk v0.1.0).

Node group min/max bounds are read from the MK8S autoscaling spec.
Scaling is performed by patching FixedNodeCount on the node group.
Node-to-group membership uses nebius.com/node-group-id labels and
cached compute instance providerIDs.

Credentials via JSON config file or env vars:
- NEBIUS_IAM_TOKEN
- NEBIUS_CLUSTER_ID
- NEBIUS_PARENT_ID
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 20, 2026
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 20, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 20, 2026

CLA Not Signed

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If SIG Autoscaling contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @wemoveon2!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 20, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @wemoveon2. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 20, 2026
- Fix copyright year to 2025
- Add sync.Mutex to Manager for concurrency safety on nodeGroups
- Change instances from []string to map[string]struct{} for O(1) lookup
- Implement DeleteNodes with specific instance deletion via Compute API
- Add DeleteInstance to nebiusAPI interface and SDK client
- Remove unused getNodeGroupForInstance (dead code)
- Fix import ordering in nebius_node_group.go (third-party before k8s)
- Document why setNodeGroupSize uses FixedNodeCount (Nebius API limitation)
- Add comment explaining ListInstances cannot filter by label
- Use getNodeGroups() accessor for thread-safe reads
- Add tests: env var fallback, Exist() true case, missing provider ID,
  delete instance errors, setNodeGroupSize API errors
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 20, 2026
- Add OWNERS file for code review governance
- Add README.md with setup, config, and usage docs
- Track target size in-memory on NodeGroup to prevent stale reads
  between Refresh() cycles (matching Civo/DigitalOcean pattern)
- Initialize targetSize from API status in Refresh()
- Update targetSize after IncreaseSize, DeleteNodes, DecreaseTargetSize
- TargetSize() now reads from in-memory field instead of cached proto
- Add test assertions for targetSize tracking
@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Apr 20, 2026
When ListInstances fails mid-pagination, discard all collected
instances rather than proceeding with partial data. Partial data
causes some node groups to appear to have instances while others
silently appear empty, which is worse than having no instance
data at all (where the provider falls back to label-based lookup).
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Apr 20, 2026
- Fix gofmt alignment issue in nebius_node_group.go (CI verify failure)
- Handle partial delete failure: if deleting instance N of M fails,
  adjust target size for the N-1 instances already deleted rather
  than leaving the node group in an inconsistent state
- Add klog.Warning when setNodeGroupSize converts a node group from
  autoscaling mode to fixed mode (Nebius API limitation)
- Add test for partial delete failure with size adjustment
deleteInstances now returns the count of successfully deleted instances
so that DeleteNodes can always update n.targetSize accurately, even on
partial failure. Previously, a partial delete failure would leave
targetSize at the pre-delete value, causing stale reads until the
next Refresh() cycle.

Also passes currentSize instead of pre-computed newTargetSize to
deleteInstances, making the size arithmetic clearer and localized.
…pIDLabel

- Update copyright year from 2025 to 2026 in all new files
- Add prominent README section warning that the first scale operation
  permanently converts node groups from autoscaling to fixed mode
- Clarify that nodeGroupIDLabel is set by Nebius MK8S on both K8s
  node objects and compute instance metadata (same key, two contexts)
- Update TemplateNodeInfo limitation to note scale-from-zero specifically
The kubernetes/autoscaler boilerplate check requires
"Copyright The Kubernetes Authors." with no year.
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wemoveon2
Once this PR has been reviewed and has the lgtm label, please assign bigdarkclown for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants