Add Nebius AI Cloud provider for cluster-autoscaler#9522
Add Nebius AI Cloud provider for cluster-autoscaler#9522wemoveon2 wants to merge 9 commits intokubernetes:masterfrom
Conversation
Implements CloudProvider interface for Nebius AI Cloud managed Kubernetes (MK8S). Supports node group discovery, instance caching, and scaling via the Nebius SDK (gosdk v0.1.0). Node group min/max bounds are read from the MK8S autoscaling spec. Scaling is performed by patching FixedNodeCount on the node group. Node-to-group membership uses nebius.com/node-group-id labels and cached compute instance providerIDs. Credentials via JSON config file or env vars: - NEBIUS_IAM_TOKEN - NEBIUS_CLUSTER_ID - NEBIUS_PARENT_ID
|
|
This issue is currently awaiting triage. If SIG Autoscaling contributors determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Welcome @wemoveon2! |
|
Hi @wemoveon2. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
- Fix copyright year to 2025
- Add sync.Mutex to Manager for concurrency safety on nodeGroups
- Change instances from []string to map[string]struct{} for O(1) lookup
- Implement DeleteNodes with specific instance deletion via Compute API
- Add DeleteInstance to nebiusAPI interface and SDK client
- Remove unused getNodeGroupForInstance (dead code)
- Fix import ordering in nebius_node_group.go (third-party before k8s)
- Document why setNodeGroupSize uses FixedNodeCount (Nebius API limitation)
- Add comment explaining ListInstances cannot filter by label
- Use getNodeGroups() accessor for thread-safe reads
- Add tests: env var fallback, Exist() true case, missing provider ID,
delete instance errors, setNodeGroupSize API errors
- Add OWNERS file for code review governance - Add README.md with setup, config, and usage docs - Track target size in-memory on NodeGroup to prevent stale reads between Refresh() cycles (matching Civo/DigitalOcean pattern) - Initialize targetSize from API status in Refresh() - Update targetSize after IncreaseSize, DeleteNodes, DecreaseTargetSize - TargetSize() now reads from in-memory field instead of cached proto - Add test assertions for targetSize tracking
When ListInstances fails mid-pagination, discard all collected instances rather than proceeding with partial data. Partial data causes some node groups to appear to have instances while others silently appear empty, which is worse than having no instance data at all (where the provider falls back to label-based lookup).
- Fix gofmt alignment issue in nebius_node_group.go (CI verify failure) - Handle partial delete failure: if deleting instance N of M fails, adjust target size for the N-1 instances already deleted rather than leaving the node group in an inconsistent state - Add klog.Warning when setNodeGroupSize converts a node group from autoscaling mode to fixed mode (Nebius API limitation) - Add test for partial delete failure with size adjustment
deleteInstances now returns the count of successfully deleted instances so that DeleteNodes can always update n.targetSize accurately, even on partial failure. Previously, a partial delete failure would leave targetSize at the pre-delete value, causing stale reads until the next Refresh() cycle. Also passes currentSize instead of pre-computed newTargetSize to deleteInstances, making the size arithmetic clearer and localized.
…pIDLabel - Update copyright year from 2025 to 2026 in all new files - Add prominent README section warning that the first scale operation permanently converts node groups from autoscaling to fixed mode - Clarify that nodeGroupIDLabel is set by Nebius MK8S on both K8s node objects and compute instance metadata (same key, two contexts) - Update TemplateNodeInfo limitation to note scale-from-zero specifically
The kubernetes/autoscaler boilerplate check requires "Copyright The Kubernetes Authors." with no year.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: wemoveon2 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary
Adds a new Nebius AI Cloud provider for cluster-autoscaler, supporting MK8S (managed Kubernetes) node group autoscaling via the Nebius SDK (
gosdk v0.1.0).What's included
cloudprovider/nebius/— FullCloudProviderandNodeGroupinterface implementationsnebius.com/node-group-idlabelsFixedNodeCounton the node groupnebius) and all-providers buildConfig
Credentials via JSON config file or env vars:
NEBIUS_IAM_TOKENNEBIUS_CLUSTER_IDNEBIUS_PARENT_IDKnown limitations