[Feature] Support JobDeploymentStatus as the deletion condition #4262

JiangJiaWei1103 · 2025-12-07T23:26:15Z

Why are these changes needed?

The current deletionStrategy relies exclusively on the terminal states of JobStatus (SUCCEEDED or FAILED). However, there are several scenarios in which a user-deployed RayJob ends up with JobStatus == "" (JobStatusNew) while JobDeploymentStatus == "Failed". In these cases, the associated resources (e.g., RayJob, RayCluster, etc.) remain stuck and are never cleaned up, resulting in indefinite resource consumption.

Changes

Add the JobDeploymentStatus field to DeletionCondition
- Currently supports Failed only
Enforce mutual exclusivity between JobStatus and JobDeploymentStatus within DeletionCondition

Implementation Details

To determine which field the user specifies, we use pointers instead of raw values. Both JobStatus and JobDeploymentStatus have empty strings as their zero values, which correspond to a "new" state. Using nil allows us to reliably distinguish between "unspecified" and "explicitly set," avoiding unintended ambiguity.

Related issue number

Closes #4233.

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: JiangJiaWei1103 <[email protected]>

…cies Signed-off-by: JiangJiaWei1103 <[email protected]>

Signed-off-by: JiangJiaWei1103 <[email protected]>

rueian · 2025-12-08T00:13:23Z

The helm lint is failing.

pickymodel · 2025-12-08T00:55:52Z

The helm lint is failing.

Will fix after getting of work, thanks for reviewing!

Future-Outlier · 2025-12-08T10:21:09Z

cc @seanlaii and @win5923 for help. Note that we need to wait until @andrewsykim is back to discuss the API change.

Signed-off-by: JiangJiaWei1103 <[email protected]>

win5923

Hi @JiangJiaWei1103, Can you also update the comment to mention that JobDeploymentStatus is also support?

kuberay/ray-operator/config/samples/ray-job.deletion-rules.yaml

Lines 12 to 22 in e32405e

    
             # DeletionStrategy defines the deletion policies for a RayJob. 
        
             # It allows for fine-grained control over resource cleanup after a job finishes. 
        
             # DeletionRules is a list of deletion rules, processed based on their trigger conditions. 
        
             # While the rules can be used to define a sequence, if multiple rules are overdue (e.g., due to controller downtime), 
        
             # the most impactful rule (e.g., DeleteCluster) will be executed first to prioritize resource cleanup and cost savings. 
        
             deletionStrategy: 
        
               # This sample demonstrates a staged cleanup process for a RayJob. 
        
               # Regardless of whether the job succeeds or fails, the cleanup follows these steps: 
        
               # 1. After 30 seconds, the worker pods are deleted. This allows for quick resource release while keeping the head pod for debugging. 
        
               # 2. After 60 seconds, the entire RayCluster (including the head pod) is deleted. 
        
               # 3. After 90 seconds, the RayJob custom resource itself is deleted, removing it from the Kubernetes API server.

Signed-off-by: JiangJiaWei1103 <[email protected]>

JiangJiaWei1103 · 2025-12-09T11:39:20Z

Hi @JiangJiaWei1103, Can you also update the comment to mention that JobDeploymentStatus is also support?

kuberay/ray-operator/config/samples/ray-job.deletion-rules.yaml

Lines 12 to 22 in e32405e

# DeletionStrategy defines the deletion policies for a RayJob.

# It allows for fine-grained control over resource cleanup after a job finishes.

# DeletionRules is a list of deletion rules, processed based on their trigger conditions.

# While the rules can be used to define a sequence, if multiple rules are overdue (e.g., due to controller downtime),

# the most impactful rule (e.g., DeleteCluster) will be executed first to prioritize resource cleanup and cost savings.

deletionStrategy:

# This sample demonstrates a staged cleanup process for a RayJob.

# Regardless of whether the job succeeds or fails, the cleanup follows these steps:

# 1. After 30 seconds, the worker pods are deleted. This allows for quick resource release while keeping the head pod for debugging.

# 2. After 60 seconds, the entire RayCluster (including the head pod) is deleted.

# 3. After 90 seconds, the RayJob custom resource itself is deleted, removing it from the Kubernetes API server.

Hi @win5923, nice suggestion. I'm considering adding one more sample demonstrating JobDeploymentStatus-based deletion rules, wdyt?

ray-operator/controllers/ray/rayjob_controller.go

win5923 · 2025-12-09T15:51:23Z

ray-operator/controllers/ray/utils/validation.go

+	// Group TTLs by condition type for cross-rule validation and uniqueness checking.
+	// We separate JobStatus and JobDeploymentStatus to avoid confusion.
+	rulesByJobStatus := make(map[rayv1.JobStatus]map[rayv1.DeletionPolicyType]int32)
+	rulesByJobDeploymentStatus := make(map[rayv1.JobDeploymentStatus]map[rayv1.DeletionPolicyType]int32)


I think this is more clear, WDYT?

// validateDeletionRules validates the deletion rules in the RayJob spec. // It performs per-rule validations, checks for uniqueness, and ensures logical TTL consistency. // Errors are collected and returned as a single aggregated error using errors.Join for better user feedback. func validateDeletionRules(rayJob *rayv1.RayJob) error { rules := rayJob.Spec.DeletionStrategy.DeletionRules isClusterSelectorMode := len(rayJob.Spec.ClusterSelector) != 0 // Group TTLs by condition type for cross-rule validation and uniqueness checking. rulesByCondition := make(map[string]map[rayv1.DeletionPolicyType]int32) var errs []error // Single pass: Validate each rule individually and group for later consistency checks. for i, rule := range rules { // Validate and extract the condition key. conditionKey, err := getDeletionCondition(&rule.Condition) if err != nil { errs = append(errs, fmt.Errorf("deletionRules[%d]: %w", i, err)) continue } // Validate TTL is non-negative. if rule.Condition.TTLSeconds < 0 { errs = append(errs, fmt.Errorf("deletionRules[%d]: TTLSeconds must be non-negative", i)) continue } // Contextual validations based on spec. if isClusterSelectorMode && (rule.Policy == rayv1.DeleteCluster || rule.Policy == rayv1.DeleteWorkers) { errs = append(errs, fmt.Errorf("deletionRules[%d]: DeletionPolicyType '%s' not supported when ClusterSelector is set", i, rule.Policy)) continue } if IsAutoscalingEnabled(rayJob.Spec.RayClusterSpec) && rule.Policy == rayv1.DeleteWorkers { // TODO (rueian): Support in future Ray versions by checking RayVersion. errs = append(errs, fmt.Errorf("deletionRules[%d]: DeletionPolicyType 'DeleteWorkers' not supported with autoscaling enabled", i)) continue } // Group valid rule for consistency check if _, ok := rulesByCondition[conditionKey]; !ok { rulesByCondition[conditionKey] = make(map[rayv1.DeletionPolicyType]int32) } // Check for uniqueness of (condition, DeletionPolicyType) pair. if _, exists := rulesByCondition[conditionKey][rule.Policy]; exists { errs = append(errs, fmt.Errorf("deletionRules[%d]: duplicate rule for %s and DeletionPolicyType '%s'", i, conditionKey, rule.Policy)) continue } rulesByCondition[conditionKey][rule.Policy] = rule.Condition.TTLSeconds } // Second pass: Validate TTL consistency for each condition group. for conditionKey, policyTTLs := range rulesByCondition { // Extract the condition type and value from the key (e.g., "JobStatus:FAILED" -> "JobStatus", "FAILED") parts := strings.SplitN(conditionKey, ":", 2) if len(parts) != 2 { // This should never happen due to getDeletionCondition contract, errs = append(errs, fmt.Errorf("internal error: invalid condition key format: %s", conditionKey)) continue } conditionType := parts[0] // "JobStatus" or "JobDeploymentStatus" conditionValue := parts[1] // "SUCCEEDED", "FAILED", etc. if err := validateTTLConsistency(policyTTLs, conditionType, conditionValue); err != nil { errs = append(errs, err) } } return errstd.Join(errs...) } func getDeletionCondition(cond *rayv1.DeletionCondition) (string, error) { hasJobStatus := cond.JobStatus != nil hasJobDeploymentStatus := cond.JobDeploymentStatus != nil if hasJobStatus && hasJobDeploymentStatus { return "", fmt.Errorf("cannot set both JobStatus and JobDeploymentStatus at the same time") } if !hasJobStatus && !hasJobDeploymentStatus { return "", fmt.Errorf("exactly one of JobStatus and JobDeploymentStatus must be set") } if hasJobStatus { return fmt.Sprintf("JobStatus:%s", *cond.JobStatus), nil } return fmt.Sprintf("JobDeploymentStatus:%s", *cond.JobDeploymentStatus), nil } // validateTTLConsistency ensures TTLs follow the deletion hierarchy: Workers <= Cluster <= Self. // (Lower TTL means deletes earlier.) func validateTTLConsistency(policyTTLs map[rayv1.DeletionPolicyType]int32, conditionType string, conditionValue string) error { // Define the required deletion order. TTLs must be non-decreasing along this sequence. deletionOrder := []rayv1.DeletionPolicyType{ rayv1.DeleteWorkers, rayv1.DeleteCluster, rayv1.DeleteSelf, } var prevPolicy rayv1.DeletionPolicyType var prevTTL int32 var hasPrev bool var errs []error for _, policy := range deletionOrder { ttl, exists := policyTTLs[policy] if !exists { continue } if hasPrev && ttl < prevTTL { errs = append(errs, fmt.Errorf( "for %s '%s': %s TTL (%d) must be >= %s TTL (%d)", conditionType, conditionValue, policy, ttl, prevPolicy, prevTTL, )) } prevPolicy = policy prevTTL = ttl hasPrev = true } return errstd.Join(errs...) }

ray-operator/apis/ray/v1/rayjob_types.go

win5923 · 2025-12-09T16:17:26Z

ray-operator/test/e2erayjob/rayjob_deletion_strategy_test.go

+			WithSpec(rayv1ac.RayJobSpec().
+				WithRayClusterSpec(NewRayClusterSpec(MountConfigMap[rayv1ac.RayClusterSpecApplyConfiguration](jobs, "/home/ray/jobs"))).
+				WithEntrypoint("python /home/ray/jobs/long_running.py").
+				WithActiveDeadlineSeconds(45).       // Short deadline for failing the JobDeploymentStatus, but making sure the cluster is running


Why set 45 seconds here, Is this stable?

Signed-off-by: JiangJiaWei1103 <[email protected]>

ray-operator/controllers/ray/utils/validation.go

machichima · 2025-12-10T12:36:08Z

ray-operator/controllers/ray/utils/validation.go

-			rulesByStatus[rule.Condition.JobStatus] = policyTTLs
-		}
+		if hasJobStatus {
+			policyTTLs, ok := rulesByJobStatus[*rule.Condition.JobStatus]


Curious, I think I didn't find where the values are assigned to rulesByJobStatus

Co-authored-by: Nary Yeh <[email protected]> Signed-off-by: 江家瑋 <[email protected]>

JiangJiaWei1103 added 6 commits December 7, 2025 12:29

feat: Support JobDeploymentStatus as the deletion condition

8c2132a

Signed-off-by: JiangJiaWei1103 <[email protected]>

chore: Regenerate utility codes

596ed83

Signed-off-by: JiangJiaWei1103 <[email protected]>

docs: Update api docs

f0766c4

Signed-off-by: JiangJiaWei1103 <[email protected]>

fix(test): Change JobStatus of the deletion condition from val to ptr

6ced1b6

Signed-off-by: JiangJiaWei1103 <[email protected]>

test: Add JobDeploymentStatus-based e2e tests with four deletion poli…

909fc40

…cies Signed-off-by: JiangJiaWei1103 <[email protected]>

test: Add validation tests for JobDeploymentStatus-based deletion rules

378cc23

Signed-off-by: JiangJiaWei1103 <[email protected]>

JiangJiaWei1103 requested review from MortalHappiness, andrewsykim, kevin85421 and rueian as code owners December 7, 2025 23:26

fix: Sync CRD yaml files into helm chart

1c0cf0b

Signed-off-by: JiangJiaWei1103 <[email protected]>

win5923 reviewed Dec 8, 2025

View reviewed changes

docs: Support JobDeploymentStatus as deletion condition

72ed628

Signed-off-by: JiangJiaWei1103 <[email protected]>

win5923 reviewed Dec 9, 2025

View reviewed changes

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

win5923 reviewed Dec 9, 2025

View reviewed changes

ray-operator/apis/ray/v1/rayjob_types.go Outdated Show resolved Hide resolved

win5923 reviewed Dec 9, 2025

View reviewed changes

JiangJiaWei1103 added 2 commits December 10, 2025 07:23

refactor: Add a helper to check rule match

aa8f8c1

Signed-off-by: JiangJiaWei1103 <[email protected]>

docs: Complete TTLSeconds description

1c6c28d

Signed-off-by: JiangJiaWei1103 <[email protected]>

machichima reviewed Dec 10, 2025

View reviewed changes

refactor: Keep validation logic aligned with kubebuilder

2386ea7

Co-authored-by: Nary Yeh <[email protected]> Signed-off-by: 江家瑋 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support JobDeploymentStatus as the deletion condition #4262

[Feature] Support JobDeploymentStatus as the deletion condition #4262

Uh oh!

JiangJiaWei1103 commented Dec 7, 2025

Uh oh!

rueian commented Dec 8, 2025

Uh oh!

pickymodel commented Dec 8, 2025

Uh oh!

Future-Outlier commented Dec 8, 2025

Uh oh!

win5923 left a comment

Uh oh!

JiangJiaWei1103 commented Dec 9, 2025

Uh oh!

Uh oh!

win5923 Dec 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

win5923 Dec 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

machichima Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	# DeletionStrategy defines the deletion policies for a RayJob.
	# It allows for fine-grained control over resource cleanup after a job finishes.
	# DeletionRules is a list of deletion rules, processed based on their trigger conditions.
	# While the rules can be used to define a sequence, if multiple rules are overdue (e.g., due to controller downtime),
	# the most impactful rule (e.g., DeleteCluster) will be executed first to prioritize resource cleanup and cost savings.
	deletionStrategy:
	# This sample demonstrates a staged cleanup process for a RayJob.
	# Regardless of whether the job succeeds or fails, the cleanup follows these steps:
	# 1. After 30 seconds, the worker pods are deleted. This allows for quick resource release while keeping the head pod for debugging.
	# 2. After 60 seconds, the entire RayCluster (including the head pod) is deleted.
	# 3. After 90 seconds, the RayJob custom resource itself is deleted, removing it from the Kubernetes API server.

[Feature] Support JobDeploymentStatus as the deletion condition #4262

Are you sure you want to change the base?

[Feature] Support JobDeploymentStatus as the deletion condition #4262

Uh oh!

Conversation

JiangJiaWei1103 commented Dec 7, 2025

Why are these changes needed?

Changes

Implementation Details

Related issue number

Checks

Uh oh!

rueian commented Dec 8, 2025

Uh oh!

pickymodel commented Dec 8, 2025

Uh oh!

Future-Outlier commented Dec 8, 2025

Uh oh!

win5923 left a comment

Choose a reason for hiding this comment

Uh oh!

JiangJiaWei1103 commented Dec 9, 2025

Uh oh!

Uh oh!

win5923 Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

win5923 Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

machichima Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

win5923 Dec 9, 2025 •

edited

Loading

win5923 Dec 9, 2025 •

edited

Loading