Skip to content

✨ spot instance fallback: add WaitingTimeout#5868

Open
Prucek wants to merge 1 commit intokubernetes-sigs:mainfrom
Prucek:feat/spot-instance-waiting-timeout
Open

✨ spot instance fallback: add WaitingTimeout#5868
Prucek wants to merge 1 commit intokubernetes-sigs:mainfrom
Prucek:feat/spot-instance-waiting-timeout

Conversation

@Prucek
Copy link
Member

@Prucek Prucek commented Feb 11, 2026

What type of PR is this?
/kind feature
/kind api-change

What this PR does / why we need it:
This feature is adding WaitingTimeout for SpotMarketOptions that is optional. The WaitingTimeout serves as a timeout when no Spot instances are available and after this timeout is reached a regular ec2 instance will be acquired instead.

2nd attempt of #5062

Checklist:

  • squashed commits
  • includes documentation
  • includes emoji in title
  • adds unit tests
  • adds or updates e2e tests

Release note:

feat: add WaitingTimeout for Spot instances to fallback to a regular ec2 instance

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Feb 11, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign dlipovetsky for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 11, 2026
@Prucek
Copy link
Member Author

Prucek commented Feb 11, 2026

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@Prucek
Copy link
Member Author

Prucek commented Feb 11, 2026

/test pull-cluster-api-provider-aws-e2e-blocking

@Prucek
Copy link
Member Author

Prucek commented Feb 11, 2026

/retest

@Prucek
Copy link
Member Author

Prucek commented Feb 11, 2026

Could someone help me understand the e2e failures? And if they are valid or just flakes? I see the e2es have a very high error/flake rate.

@k8s-ci-robot
Copy link
Contributor

@Prucek: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-e2e 5ac452c link false /test pull-cluster-api-provider-aws-e2e
pull-cluster-api-provider-aws-e2e-eks 5ac452c link false /test pull-cluster-api-provider-aws-e2e-eks

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@Arpit529Srivastava Arpit529Srivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could someone help me understand the e2e failures? And if they are valid or just flakes? I see the e2es have a very high error/flake rate.

hey @Prucek i was looking into the e2e flakiness, agree with you none of the failures is related to this Pr.

some findings from my side

  1. Clusterctl upgrade test failure - the CAPI test framework at v1.11.4-0.20251125201037-d322ff6baa2f (pinned in go.mod) changed how it builds the local filesystem repository. the old clusterctl v1.2.0 binary expects provider labels like cluster-api, but the new framework creates core-cluster-api. this is a compatibility break in the CAPI test framework, not this PR.

  2. I checked GPU test code and it doesn't use SpotMarketOptions at all. GPU tests are consistently flaky due to p3.2xlarge capacity.

  3. The EKS auth test uses cluster-template-eks-auth-bootstrap-disabled.yaml, which creates only AWSManagedCluster + AWSManagedControlPlane (0 workers, no spot config). The DumpMachines panic is a pre-existing bug at common.go:117 where namespace isn't nil-checked.

that being said,
could you check if commit 8fad5a1 (the base for this PR) passes these same tests on main? If main is also fails, then these are pre-existing issues. If main passes, then there's something more subtle I'm missing.

Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Prucek I have a question about whether we could implement this with ValidUntil.

out, err = s.runInstancesWithSpotFallback(input, i.SpotMarketOptions)
if err != nil {
return nil, errors.Wrap(err, "failed to run instance with spot fallback")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd invert this condition for easier reading and control flow

		if isSpotCapacityError(err) && isSpotWaitingTimeoutDefined(i) {
		  out, err = s.runInstancesWithSpotFallback(input, i.SpotMarketOptions)
  		  if err != nil {
  		    return nil, errors.Wrap(err, "failed to run instance with spot fallback")
  		  }
		} else {
		  return nil, errors.Wrap(err, "failed to run instance")
        }

Comment on lines +758 to +771
// runInstancesWithSpotFallback attempts to run an EC2 instance, handling spot instance capacity issues.
// If the instance is configured with spot market options and a waiting timeout, it will retry
// until either a spot instance is acquired or the timeout is reached. If the timeout is reached,
// it falls back to creating an on-demand instance.
func (s *Service) runInstancesWithSpotFallback(input *ec2.RunInstancesInput, spotOptions *infrav1.SpotMarketOptions) (*ec2.RunInstancesOutput, error) {
timeout := spotOptions.WaitingTimeout.Duration
deadline := time.Now().Add(timeout)

s.scope.Info("Attempting to acquire spot instance with fallback timeout",
"timeout", timeout.String())

for {
time.Sleep(spotRetryInterval)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at trying to understand if we can use the aws-sdk-go-v2's SpotMarketOptions.ValidUntil option (from here) as it would significantly simplify the reconciliation logic

But my understanding is that to use it we'd need to set

spotOpts.SpotInstanceType = types.SpotInstanceTypePersistent                   // This is the only way to set ValidUntil for one-time requests.
spotOpts.InstanceInterruptionBehavior = types.InstanceInterruptionBehaviorStop // This is the only way to set InstanceInterruptionBehavior for one-time requests.

which might be problematic as it may conflict with "CAPA's current 1:1 Machine-to-Instance model".

Have you looked into this yourself?

Curious to see what others think of this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added new fields to the status. This would solve the issue of not having the 1:1Machine-to-Instance.

Comment on lines +755 to +756
// spotRetryInterval is the interval between spot instance capacity checks.
const spotRetryInterval = 30 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably make this configurable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rather added the logic to the reconcile loop.

@Prucek Prucek force-pushed the feat/spot-instance-waiting-timeout branch from 5ac452c to cda8e89 Compare February 13, 2026 12:04
@Prucek
Copy link
Member Author

Prucek commented Feb 13, 2026

@damdo Could you take another look? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. needs-priority release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants