Skip to content

[Jobs] Allow custom job recovery strategy configuration#9154

Draft
Michaelvll wants to merge 2 commits intomasterfrom
feature/extensible-job-recovery-strategies
Draft

[Jobs] Allow custom job recovery strategy configuration#9154
Michaelvll wants to merge 2 commits intomasterfrom
feature/extensible-job-recovery-strategies

Conversation

@Michaelvll
Copy link
Collaborator

Summary

  • Add register_job_recovery_property() in schemas.py so additional strategy-specific fields can be registered for the job_recovery schema while keeping additionalProperties: False
  • Add set_strategy_config() on StrategyExecutormake() now passes remaining dict keys (after common ones like strategy, max_restarts_on_errors, recover_on_exit_codes) to the executor via this method. Subclasses can override to accept custom parameters.
  • Make _try_validate_managed_job_attributes lenient for strategy names not yet in the registry, deferring full validation to the server

Test plan

  • Existing unit tests pass — no behavior change for built-in strategies (FAILOVER/EAGER_NEXT_REGION) since set_strategy_config() is a no-op for empty config by default

🤖 Generated with Claude Code

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the job recovery system by making it more extensible and flexible. It allows for the definition and handling of custom, strategy-specific configuration parameters within the job recovery schema, and ensures that client-side validation gracefully handles strategies that might only be known to the server. This change primarily supports plugin-based extensions to job recovery mechanisms without requiring core code modifications for each new strategy.

Highlights

  • Extensible Job Recovery Configuration: Introduced register_job_recovery_property() in schemas.py to allow plugins to register additional strategy-specific fields for the job_recovery schema, enabling custom configurations while maintaining strict schema validation.
  • Strategy-Specific Configuration Handling: Added a set_strategy_config() method to StrategyExecutor. The make() method now passes any remaining dictionary keys from the job_recovery configuration (after common fields are extracted) to this new method, allowing subclasses to process custom parameters.
  • Lenient Client-Side Validation: Modified _try_validate_managed_job_attributes to be more lenient for job recovery strategy names not yet registered on the client, deferring full validation to the server to support server-side plugins.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to allow custom job recovery strategy configurations. The changes include adding a registration function for strategy-specific schema properties, a new method in StrategyExecutor to handle these custom configurations, and making the client-side validation of strategy names more lenient. The implementation is clean and the plugin-based approach for extending the schema is a good design choice. The changes look correct and well-integrated.

@Michaelvll Michaelvll force-pushed the feature/extensible-job-recovery-strategies branch 7 times, most recently from 402ab3e to 06b34a0 Compare March 23, 2026 20:32
Add extension points so plugins can register custom recovery strategies
with strategy-specific configuration:

1. schemas.py: Add register_job_recovery_property() for plugins to
   extend the job_recovery JSON schema with custom fields while keeping
   additionalProperties: False.

2. recovery_strategy.py: Add set_strategy_config() method on
   StrategyExecutor base class. After make() pops common keys
   (strategy, max_restarts_on_errors, recover_on_exit_codes), remaining
   dict keys are passed to the executor via set_strategy_config().
   Subclasses override to handle strategy-specific config.

3. resources.py: Make _try_validate_managed_job_attributes lenient for
   unknown strategy names, deferring validation to the server where
   plugins may have registered additional strategies.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Michaelvll Michaelvll force-pushed the feature/extensible-job-recovery-strategies branch from 06b34a0 to e9fd20f Compare March 24, 2026 01:29
Add warm_nodes field to ProvisionConfig. When set, the K8s provisioner:
- Creates additional warm pods with skypilot.co/role=warm label
- Only waits for active pods to be Running (warm pods provision async)
- Skips Ray worker start on warm pods in instance_setup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant