Prevent phantom autoscaler history entries on failed replica updates#734
Prevent phantom autoscaler history entries on failed replica updates#734WHOIM1205 wants to merge 2 commits intovolcano-sh:mainfrom
Conversation
Signed-off-by: WHOIM1205 <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @WHOIM1205, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical correctness bug in the Kthena autoscaler that led to corrupted scaling history when replica updates to the Kubernetes API failed. By refactoring the autoscaler to separate scaling computation from history commitment, the system now ensures that history accurately reflects only successfully applied state. This change prevents 'phantom scale events,' restores deterministic stabilization and rate-limiting, and significantly improves the autoscaler's reliability, particularly when operating alongside other scaling mechanisms like HPA or KEDA. Additionally, error reporting for API interactions has been made more precise, aiding in debugging. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This is a high-quality pull request that addresses a critical correctness bug in the autoscaler. The separation of concerns, by moving history updates out of the Scale and Optimize functions and into explicit Commit functions, is a great architectural improvement. This change effectively prevents phantom history entries when API updates fail. The improved error propagation with more context is also a welcome change that will make debugging much easier. The new comprehensive test suite is excellent and provides strong confidence in the fix. I have a few suggestions to further improve the code, mainly around preventing potential integer overflows and improving code clarity.
Signed-off-by: WHOIM1205 <[email protected]>
FAUST-BENCHOU
left a comment
There was a problem hiding this comment.
Here's my advice.we dont need to add more struct to store history or else.we can simply change the place of AppendRecommendation and AppendCorrected
for example
kthena/pkg/autoscaler/autoscaler/optimizer.go
Lines 203 to 204 in a8c9193
kthena/pkg/autoscaler/autoscaler/scaler.go
Lines 104 to 105 in a8c9193
can be moved to doOptimize and doScale in autoscale_controller
to make sure the controller call Status.AppendRecommendation/AppendCorrected only after all API updates succeed.However, that means we need to also return recommendedInstances in func Optimize and Scale
thanks for the suggestion |
|
/assign @YaoZengzeng |
|
/assign @hzxuzhonghu |
|
@hzxuzhonghu is there anything i can change |
Fix autoscaler history corruption on failed replica updates
Summary
This PR fixes a correctness bug in the Kthena autoscaler where scaling decisions were recorded in stabilization and rate-limiting history before replica updates were successfully applied to the Kubernetes API.
When the API update failed (most commonly due to conflicts with concurrent HPA/KEDA updates), the autoscaler history was left in a corrupted state, causing phantom scale events to influence future scaling decisions.
This resulted in silent and non-deterministic autoscaling behavior.
What was wrong
The autoscaler updated its internal history inside
Scale()/Optimize(), prior to attempting the API update:This issue is especially visible when Kthena runs alongside HPA or KEDA, where update conflicts are expected.
What this PR fixes
Separates computation from side effects
Scale()andOptimize()now return results without mutating historyPrevents phantom scale events
Ensures atomic behavior for heterogeneous scaling
Fixes error masking
updateTargetReplicas()now returns real API errors instead of misleading"not supported"messagesCode changes
Changes are scoped and minimal, focused on correctness:
pkg/autoscaler/autoscaler/scaler.goScaleResultCommitScaleResult()for explicit history commitsoptimizer.goOptimizeResultpkg/autoscaler/controller/autoscale_controller.goupdateTargetReplicas()Impact
No API changes and no behavior change on successful updates — only correctness improvements on failure paths.
Test verification
Added comprehensive table-driven tests covering both success and failure paths:
TestDoScale_HistoryCommitBehaviorTestDoScale_ErrorPropagationTestDoOptimize_HistoryCommitOnlyAfterAllUpdatesSucceedScaleResult,OptimizeResult)Run only the new tests: