✨ spot instance fallback: add WaitingTimeout#5868
✨ spot instance fallback: add WaitingTimeout#5868Prucek wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test pull-cluster-api-provider-aws-e2e |
|
/test pull-cluster-api-provider-aws-e2e-blocking |
|
/retest |
|
Could someone help me understand the e2e failures? And if they are valid or just flakes? I see the e2es have a very high error/flake rate. |
|
@Prucek: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Arpit529Srivastava
left a comment
There was a problem hiding this comment.
Could someone help me understand the e2e failures? And if they are valid or just flakes? I see the e2es have a very high error/flake rate.
hey @Prucek i was looking into the e2e flakiness, agree with you none of the failures is related to this Pr.
some findings from my side
-
Clusterctl upgrade test failure - the CAPI test framework at
v1.11.4-0.20251125201037-d322ff6baa2f(pinned in go.mod) changed how it builds the local filesystem repository. the old clusterctl v1.2.0 binary expects provider labels likecluster-api, but the new framework createscore-cluster-api. this is a compatibility break in the CAPI test framework, not this PR. -
I checked GPU test code and it doesn't use
SpotMarketOptionsat all. GPU tests are consistently flaky due top3.2xlargecapacity. -
The EKS auth test uses cluster-template-eks-auth-bootstrap-disabled.yaml, which creates only
AWSManagedCluster+AWSManagedControlPlane(0 workers, no spot config). The DumpMachines panic is a pre-existing bug at common.go:117 wherenamespaceisn't nil-checked.
that being said,
could you check if commit 8fad5a1 (the base for this PR) passes these same tests on main? If main is also fails, then these are pre-existing issues. If main passes, then there's something more subtle I'm missing.
| out, err = s.runInstancesWithSpotFallback(input, i.SpotMarketOptions) | ||
| if err != nil { | ||
| return nil, errors.Wrap(err, "failed to run instance with spot fallback") | ||
| } |
There was a problem hiding this comment.
I'd invert this condition for easier reading and control flow
if isSpotCapacityError(err) && isSpotWaitingTimeoutDefined(i) {
out, err = s.runInstancesWithSpotFallback(input, i.SpotMarketOptions)
if err != nil {
return nil, errors.Wrap(err, "failed to run instance with spot fallback")
}
} else {
return nil, errors.Wrap(err, "failed to run instance")
}
pkg/cloud/services/ec2/instances.go
Outdated
| // runInstancesWithSpotFallback attempts to run an EC2 instance, handling spot instance capacity issues. | ||
| // If the instance is configured with spot market options and a waiting timeout, it will retry | ||
| // until either a spot instance is acquired or the timeout is reached. If the timeout is reached, | ||
| // it falls back to creating an on-demand instance. | ||
| func (s *Service) runInstancesWithSpotFallback(input *ec2.RunInstancesInput, spotOptions *infrav1.SpotMarketOptions) (*ec2.RunInstancesOutput, error) { | ||
| timeout := spotOptions.WaitingTimeout.Duration | ||
| deadline := time.Now().Add(timeout) | ||
|
|
||
| s.scope.Info("Attempting to acquire spot instance with fallback timeout", | ||
| "timeout", timeout.String()) | ||
|
|
||
| for { | ||
| time.Sleep(spotRetryInterval) | ||
|
|
There was a problem hiding this comment.
I had a look at trying to understand if we can use the aws-sdk-go-v2's SpotMarketOptions.ValidUntil option (from here) as it would significantly simplify the reconciliation logic
But my understanding is that to use it we'd need to set
spotOpts.SpotInstanceType = types.SpotInstanceTypePersistent // This is the only way to set ValidUntil for one-time requests.
spotOpts.InstanceInterruptionBehavior = types.InstanceInterruptionBehaviorStop // This is the only way to set InstanceInterruptionBehavior for one-time requests.
which might be problematic as it may conflict with "CAPA's current 1:1 Machine-to-Instance model".
Have you looked into this yourself?
Curious to see what others think of this
There was a problem hiding this comment.
I added new fields to the status. This would solve the issue of not having the 1:1Machine-to-Instance.
pkg/cloud/services/ec2/instances.go
Outdated
| // spotRetryInterval is the interval between spot instance capacity checks. | ||
| const spotRetryInterval = 30 * time.Second |
There was a problem hiding this comment.
We should probably make this configurable
There was a problem hiding this comment.
I rather added the logic to the reconcile loop.
5ac452c to
cda8e89
Compare
|
@damdo Could you take another look? Thanks! |
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
This feature is adding WaitingTimeout for SpotMarketOptions that is optional. The WaitingTimeout serves as a timeout when no Spot instances are available and after this timeout is reached a regular ec2 instance will be acquired instead.
2nd attempt of #5062
Checklist:
Release note: