Add Nebius AI Cloud provider for cluster-autoscaler by wemoveon2 · Pull Request #9522 · kubernetes/autoscaler

wemoveon2 · 2026-04-20T15:34:52Z

Summary

Adds a new Nebius AI Cloud provider for cluster-autoscaler, supporting MK8S (managed Kubernetes) node group autoscaling via the Nebius SDK (gosdk v0.1.0).

What's included

cloudprovider/nebius/ — Full CloudProvider and NodeGroup interface implementations
Node group discovery — Lists node groups from the Nebius MK8S API with pagination support
Instance caching — Maps compute instances to node groups via nebius.com/node-group-id labels
Scaling — Sets target size by patching FixedNodeCount on the node group
Node deletion — Deletes specific instances via the Nebius Compute API
Min/max bounds — Read from the MK8S autoscaling spec
Concurrency safety — Mutex-protected node group cache
Builder registration — Standalone build tag (nebius) and all-providers build
Unit tests — 31 tests covering config, refresh, pagination, scaling, deletion, node group lookup

Config

Credentials via JSON config file or env vars:

Env var	Purpose
`NEBIUS_IAM_TOKEN`	Authentication
`NEBIUS_CLUSTER_ID`	MK8S cluster to manage
`NEBIUS_PARENT_ID`	Folder containing compute instances

Known limitations

TemplateNodeInfo not implemented — Scale-up simulation is not yet supported. The Nebius MK8S API does not expose node templates.
setNodeGroupSize uses FixedNodeCount — The Nebius API uses a oneOf for size (Autoscaling or FixedNodeCount) with no desired-count field in the autoscaling spec. Setting a specific target requires switching to fixed mode.
ListInstances fetches all instances — The Nebius API does not support label filtering on instance listing. Client-side filtering is applied.

Add Nebius AI Cloud (MK8S) cloud provider for cluster-autoscaler

Implements CloudProvider interface for Nebius AI Cloud managed Kubernetes (MK8S). Supports node group discovery, instance caching, and scaling via the Nebius SDK (gosdk v0.1.0). Node group min/max bounds are read from the MK8S autoscaling spec. Scaling is performed by patching FixedNodeCount on the node group. Node-to-group membership uses nebius.com/node-group-id labels and cached compute instance providerIDs. Credentials via JSON config file or env vars: - NEBIUS_IAM_TOKEN - NEBIUS_CLUSTER_ID - NEBIUS_PARENT_ID

linux-foundation-easycla · 2026-04-20T15:35:01Z

❌ - login: @wemoveon2 / name: Alan Yu . The commit (20ed8c5, 21b5eaf, 44e6c86, 60452c9, 632d386, 7367735, 7a6580e, 8a96ea0, d774ad5) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

k8s-ci-robot · 2026-04-20T15:35:01Z

This issue is currently awaiting triage.

If SIG Autoscaling contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-04-20T15:35:02Z

Welcome @wemoveon2!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-04-20T15:35:04Z

Hi @wemoveon2. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

- Fix copyright year to 2025 - Add sync.Mutex to Manager for concurrency safety on nodeGroups - Change instances from []string to map[string]struct{} for O(1) lookup - Implement DeleteNodes with specific instance deletion via Compute API - Add DeleteInstance to nebiusAPI interface and SDK client - Remove unused getNodeGroupForInstance (dead code) - Fix import ordering in nebius_node_group.go (third-party before k8s) - Document why setNodeGroupSize uses FixedNodeCount (Nebius API limitation) - Add comment explaining ListInstances cannot filter by label - Use getNodeGroups() accessor for thread-safe reads - Add tests: env var fallback, Exist() true case, missing provider ID, delete instance errors, setNodeGroupSize API errors

- Add OWNERS file for code review governance - Add README.md with setup, config, and usage docs - Track target size in-memory on NodeGroup to prevent stale reads between Refresh() cycles (matching Civo/DigitalOcean pattern) - Initialize targetSize from API status in Refresh() - Update targetSize after IncreaseSize, DeleteNodes, DecreaseTargetSize - TargetSize() now reads from in-memory field instead of cached proto - Add test assertions for targetSize tracking

When ListInstances fails mid-pagination, discard all collected instances rather than proceeding with partial data. Partial data causes some node groups to appear to have instances while others silently appear empty, which is worse than having no instance data at all (where the provider falls back to label-based lookup).

- Fix gofmt alignment issue in nebius_node_group.go (CI verify failure) - Handle partial delete failure: if deleting instance N of M fails, adjust target size for the N-1 instances already deleted rather than leaving the node group in an inconsistent state - Add klog.Warning when setNodeGroupSize converts a node group from autoscaling mode to fixed mode (Nebius API limitation) - Add test for partial delete failure with size adjustment

deleteInstances now returns the count of successfully deleted instances so that DeleteNodes can always update n.targetSize accurately, even on partial failure. Previously, a partial delete failure would leave targetSize at the pre-delete value, causing stale reads until the next Refresh() cycle. Also passes currentSize instead of pre-computed newTargetSize to deleteInstances, making the size arithmetic clearer and localized.

…pIDLabel - Update copyright year from 2025 to 2026 in all new files - Add prominent README section warning that the first scale operation permanently converts node groups from autoscaling to fixed mode - Clarify that nodeGroupIDLabel is set by Nebius MK8S on both K8s node objects and compute instance metadata (same key, two contexts) - Update TemplateNodeInfo limitation to note scale-from-zero specifically

The kubernetes/autoscaler boilerplate check requires "Copyright The Kubernetes Authors." with no year.

k8s-ci-robot · 2026-04-21T21:19:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wemoveon2
Once this PR has been reviewed and has the lgtm label, please assign bigdarkclown for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 20, 2026

k8s-ci-robot requested review from BigDarkClown and x13n April 20, 2026 15:34

k8s-ci-robot added area/cluster-autoscaler needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 20, 2026

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 20, 2026

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 20, 2026

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 20, 2026

k8s-ci-robot added the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Apr 20, 2026

wemoveon2 added 2 commits April 20, 2026 12:27

Leave OWNERS approvers/reviewers empty for upstream maintainers to fill

21b5eaf

k8s-ci-robot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Apr 20, 2026

wemoveon2 added 4 commits April 20, 2026 12:34

Remove year from copyright headers per project boilerplate policy

d774ad5

The kubernetes/autoscaler boilerplate check requires "Copyright The Kubernetes Authors." with no year.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Nebius AI Cloud provider for cluster-autoscaler#9522

Add Nebius AI Cloud provider for cluster-autoscaler#9522
wemoveon2 wants to merge 9 commits intokubernetes:masterfrom
wemoveon2:nebius-provider-upstream-pr

wemoveon2 commented Apr 20, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Apr 20, 2026

Uh oh!

k8s-ci-robot commented Apr 20, 2026

Uh oh!

k8s-ci-robot commented Apr 20, 2026

Uh oh!

k8s-ci-robot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wemoveon2 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Config

Known limitations

Uh oh!

linux-foundation-easycla Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 20, 2026

Uh oh!

k8s-ci-robot commented Apr 20, 2026

Uh oh!

k8s-ci-robot commented Apr 20, 2026

Uh oh!

k8s-ci-robot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wemoveon2 commented Apr 20, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Apr 20, 2026 •

edited

Loading