:sparkles: spot instance fallback: add WaitingTimeout by Prucek · Pull Request #5868 · kubernetes-sigs/cluster-api-provider-aws

Prucek · 2026-02-11T10:26:53Z

What type of PR is this?
/kind feature
/kind api-change

What this PR does / why we need it:
This feature is adding WaitingTimeout for SpotMarketOptions that is optional. The WaitingTimeout serves as a timeout when no Spot instances are available and after this timeout is reached a regular ec2 instance will be acquired instead.

2nd attempt of #5062

Checklist:

Release note:

feat: add WaitingTimeout for Spot instances to fallback to a regular ec2 instance

k8s-ci-robot · 2026-02-11T10:27:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign dlipovetsky for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Prucek · 2026-02-11T10:39:02Z

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

Prucek · 2026-02-11T10:39:17Z

/test pull-cluster-api-provider-aws-e2e-blocking

Prucek · 2026-02-11T10:53:03Z

/retest

Prucek · 2026-02-11T15:49:38Z

Could someone help me understand the e2e failures? And if they are valid or just flakes? I see the e2es have a very high error/flake rate.

k8s-ci-robot · 2026-02-11T15:58:36Z

@Prucek: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-aws-e2e	`5ac452c`	link	false	`/test pull-cluster-api-provider-aws-e2e`
pull-cluster-api-provider-aws-e2e-eks	`5ac452c`	link	false	`/test pull-cluster-api-provider-aws-e2e-eks`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Arpit529Srivastava

Could someone help me understand the e2e failures? And if they are valid or just flakes? I see the e2es have a very high error/flake rate.

hey @Prucek i was looking into the e2e flakiness, agree with you none of the failures is related to this Pr.

some findings from my side

Clusterctl upgrade test failure - the CAPI test framework at v1.11.4-0.20251125201037-d322ff6baa2f (pinned in go.mod) changed how it builds the local filesystem repository. the old clusterctl v1.2.0 binary expects provider labels like cluster-api, but the new framework creates core-cluster-api. this is a compatibility break in the CAPI test framework, not this PR.
I checked GPU test code and it doesn't use SpotMarketOptions at all. GPU tests are consistently flaky due to p3.2xlarge capacity.
The EKS auth test uses cluster-template-eks-auth-bootstrap-disabled.yaml, which creates only AWSManagedCluster + AWSManagedControlPlane (0 workers, no spot config). The DumpMachines panic is a pre-existing bug at common.go:117 where namespace isn't nil-checked.

that being said,
could you check if commit 8fad5a1 (the base for this PR) passes these same tests on main? If main is also fails, then these are pre-existing issues. If main passes, then there's something more subtle I'm missing.

damdo

Hey @Prucek I have a question about whether we could implement this with ValidUntil.

damdo · 2026-02-12T15:03:14Z

pkg/cloud/services/ec2/instances.go

+		out, err = s.runInstancesWithSpotFallback(input, i.SpotMarketOptions)
+		if err != nil {
+			return nil, errors.Wrap(err, "failed to run instance with spot fallback")
+		}


I'd invert this condition for easier reading and control flow

if isSpotCapacityError(err) && isSpotWaitingTimeoutDefined(i) { out, err = s.runInstancesWithSpotFallback(input, i.SpotMarketOptions) if err != nil { return nil, errors.Wrap(err, "failed to run instance with spot fallback") } } else { return nil, errors.Wrap(err, "failed to run instance") }

damdo · 2026-02-12T15:50:55Z

pkg/cloud/services/ec2/instances.go

+// runInstancesWithSpotFallback attempts to run an EC2 instance, handling spot instance capacity issues.
+// If the instance is configured with spot market options and a waiting timeout, it will retry
+// until either a spot instance is acquired or the timeout is reached. If the timeout is reached,
+// it falls back to creating an on-demand instance.
+func (s *Service) runInstancesWithSpotFallback(input *ec2.RunInstancesInput, spotOptions *infrav1.SpotMarketOptions) (*ec2.RunInstancesOutput, error) {
+	timeout := spotOptions.WaitingTimeout.Duration
+	deadline := time.Now().Add(timeout)
+
+	s.scope.Info("Attempting to acquire spot instance with fallback timeout",
+		"timeout", timeout.String())
+
+	for {
+		time.Sleep(spotRetryInterval)
+


I had a look at trying to understand if we can use the aws-sdk-go-v2's SpotMarketOptions.ValidUntil option (from here) as it would significantly simplify the reconciliation logic

But my understanding is that to use it we'd need to set

spotOpts.SpotInstanceType = types.SpotInstanceTypePersistent // This is the only way to set ValidUntil for one-time requests. spotOpts.InstanceInterruptionBehavior = types.InstanceInterruptionBehaviorStop // This is the only way to set InstanceInterruptionBehavior for one-time requests.

which might be problematic as it may conflict with "CAPA's current 1:1 Machine-to-Instance model".

Have you looked into this yourself?

Curious to see what others think of this

I added new fields to the status. This would solve the issue of not having the 1:1Machine-to-Instance.

damdo · 2026-02-12T15:54:15Z

pkg/cloud/services/ec2/instances.go

+// spotRetryInterval is the interval between spot instance capacity checks.
+const spotRetryInterval = 30 * time.Second


We should probably make this configurable

I rather added the logic to the reconcile loop.

Prucek · 2026-02-13T12:07:10Z

@damdo Could you take another look? Thanks!

k8s-ci-robot requested review from dlipovetsky and fiunchinho February 11, 2026 10:27

k8s-ci-robot added the needs-priority label Feb 11, 2026

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 11, 2026

Arpit529Srivastava reviewed Feb 11, 2026

View reviewed changes

damdo reviewed Feb 12, 2026

View reviewed changes

spot instance fallback: add WaitingTimeout

cda8e89

Prucek force-pushed the feat/spot-instance-waiting-timeout branch from 5ac452c to cda8e89 Compare February 13, 2026 12:04

		// spotRetryInterval is the interval between spot instance capacity checks.
		const spotRetryInterval = 30 * time.Second

Conversation

Prucek commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 11, 2026

Uh oh!

Prucek commented Feb 11, 2026

Uh oh!

Prucek commented Feb 11, 2026

Uh oh!

Prucek commented Feb 11, 2026

Uh oh!

Prucek commented Feb 11, 2026

Uh oh!

k8s-ci-robot commented Feb 11, 2026

Uh oh!

Arpit529Srivastava left a comment

Choose a reason for hiding this comment

Uh oh!

damdo left a comment

Choose a reason for hiding this comment

Uh oh!

damdo Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

damdo Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Prucek Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

damdo Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Prucek Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Prucek commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Prucek commented Feb 11, 2026 •

edited

Loading