Skip to content

Commit 58a3ff0

Browse files
seanlaiiCopilot
andauthored
[RayJob] Enhance RayJob DeletionStrategy to Support Multi-Stage Deletion (#4040)
* [CRD][RayJob] Define new DeletionStrategy in RayJob CRD Signed-off-by: wei-chenglai <[email protected]> * Add controller tests * trigger CI * Revert change for triggering CI * address comment * rename to TTLSeconds * fix typo * modify comment * address comment * remove duplicate errors pkg * improve api doc * add e2e tests for deletion strategy * fix lint * add feature gate override for e2e tests * fix lint & fix validation error * refactor * trigger ci * trigger ci * refactor description * improve deletion check * remove redundant comment Co-authored-by: Copilot <[email protected]> Signed-off-by: Wei-Cheng Lai <[email protected]> --------- Signed-off-by: wei-chenglai <[email protected]> Signed-off-by: Wei-Cheng Lai <[email protected]> Co-authored-by: Copilot <[email protected]>
1 parent 362da3d commit 58a3ff0

File tree

21 files changed

+3780
-218
lines changed

21 files changed

+3780
-218
lines changed

.buildkite/build-start-operator.sh

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,14 @@
77
# to kick off from the release branch so tests should match up accordingly.
88

99
if [ "$IS_FROM_RAY_RELEASE_AUTOMATION" = 1 ]; then
10-
helm repo update && helm install kuberay/kuberay-operator
10+
helm repo update
11+
echo "Installing helm chart with test override values (feature gates enabled as needed)"
12+
# NOTE: The override file is CI/test-only. It is NOT part of the released chart defaults.
13+
helm install kuberay-operator kuberay/kuberay-operator -f ../.buildkite/values-kuberay-operator-override.yaml
1114
KUBERAY_TEST_RAY_IMAGE="rayproject/ray:nightly.$(date +'%y%m%d').${RAY_NIGHTLY_COMMIT:0:6}-py39" && export KUBERAY_TEST_RAY_IMAGE
1215
else
1316
IMG=kuberay/operator:nightly make docker-image &&
1417
kind load docker-image kuberay/operator:nightly &&
15-
IMG=kuberay/operator:nightly make deploy
18+
echo "Deploying operator with test overrides (feature gates via test-overrides overlay)"
19+
IMG=kuberay/operator:nightly make deploy-with-override
1620
fi
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Generic Helm values override used only in CI / e2e test environments.
2+
# Intent:
3+
# - Allow e2e tests to turn on alpha / experimental feature gates (e.g. RayJobDeletionPolicy)
4+
# - Provide a single place contributors can extend with additional overrides needed for tests
5+
# - Keep the default published Helm chart behavior unchanged for normal users
6+
# Scope / Safety:
7+
# - This file is never referenced by the base chart; it is opt‑in via buildkite or manual helm install
8+
# - Do NOT rename it to values.yaml or commit changes that enable unstable features by default
9+
# Usage examples:
10+
# helm install kuberay-operator kuberay/kuberay-operator -f ../.buildkite/values-kuberay-operator-override.yaml
11+
# (add or remove feature gates below as e2e scenarios expand)
12+
#
13+
# Current overrides: enable RayJobDeletionPolicy alpha feature gate alongside the existing status conditions gate.
14+
featureGates:
15+
- name: RayClusterStatusConditions
16+
enabled: true
17+
- name: RayJobDeletionPolicy
18+
enabled: true

docs/reference/api.md

Lines changed: 60 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -55,11 +55,28 @@ _Appears in:_
5555

5656

5757

58-
#### DeletionPolicy
58+
#### DeletionCondition
59+
60+
61+
62+
DeletionCondition specifies the trigger conditions for a deletion action.
63+
64+
65+
66+
_Appears in:_
67+
- [DeletionRule](#deletionrule)
5968

69+
| Field | Description | Default | Validation |
70+
| --- | --- | --- | --- |
71+
| `ttlSeconds` _integer_ | TTLSeconds is the time in seconds from when the JobStatus<br />reaches the specified terminal state to when this deletion action should be triggered.<br />The value must be a non-negative integer. | 0 | Minimum: 0 <br /> |
72+
73+
74+
#### DeletionPolicy
6075

6176

6277

78+
DeletionPolicy is the legacy single-stage deletion policy.
79+
Deprecated: This struct is part of the legacy API. Use DeletionRule for new configurations.
6380

6481

6582

@@ -68,7 +85,7 @@ _Appears in:_
6885

6986
| Field | Description | Default | Validation |
7087
| --- | --- | --- | --- |
71-
| `policy` _[DeletionPolicyType](#deletionpolicytype)_ | Valid values are 'DeleteCluster', 'DeleteWorkers', 'DeleteSelf' or 'DeleteNone'. | | |
88+
| `policy` _[DeletionPolicyType](#deletionpolicytype)_ | Policy is the action to take when the condition is met.<br />This field is logically required when using the legacy OnSuccess/OnFailure policies.<br />It is marked as '+optional' at the API level to allow the 'deletionRules' field to be used instead. | | Enum: [DeleteCluster DeleteWorkers DeleteSelf DeleteNone] <br /> |
7289

7390

7491
#### DeletionPolicyType
@@ -81,14 +98,51 @@ _Underlying type:_ _string_
8198

8299
_Appears in:_
83100
- [DeletionPolicy](#deletionpolicy)
101+
- [DeletionRule](#deletionrule)
102+
103+
104+
105+
#### DeletionRule
84106

85107

86108

109+
DeletionRule defines a single deletion action and its trigger condition.
110+
This is the new, recommended way to define deletion behavior.
111+
112+
113+
114+
_Appears in:_
115+
- [DeletionStrategy](#deletionstrategy)
116+
117+
| Field | Description | Default | Validation |
118+
| --- | --- | --- | --- |
119+
| `policy` _[DeletionPolicyType](#deletionpolicytype)_ | Policy is the action to take when the condition is met. This field is required. | | Enum: [DeleteCluster DeleteWorkers DeleteSelf DeleteNone] <br /> |
120+
| `condition` _[DeletionCondition](#deletioncondition)_ | The condition under which this deletion rule is triggered. This field is required. | | |
121+
122+
87123
#### DeletionStrategy
88124

89125

90126

127+
DeletionStrategy configures automated cleanup after the RayJob reaches a terminal state.
128+
Two mutually exclusive styles are supported:
129+
130+
131+
Legacy: provide both onSuccess and onFailure (deprecated; removal planned for 1.6.0). May be combined with shutdownAfterJobFinishes and (optionally) global TTLSecondsAfterFinished.
132+
Rules: provide deletionRules (non-empty list). Rules mode is incompatible with shutdownAfterJobFinishes, legacy fields, and the global TTLSecondsAfterFinished (use per‑rule condition.ttlSeconds instead).
133+
134+
135+
Semantics:
136+
- A non-empty deletionRules selects rules mode; empty lists are treated as unset.
137+
- Legacy requires both onSuccess and onFailure; specifying only one is invalid.
138+
- Global TTLSecondsAfterFinished > 0 requires shutdownAfterJobFinishes=true; therefore it cannot be used with rules mode or with legacy alone (no shutdown).
139+
- Feature gate RayJobDeletionPolicy must be enabled when this block is present.
140+
91141

142+
Validation:
143+
- CRD XValidations prevent mixing legacy fields with deletionRules and enforce legacy completeness.
144+
- Controller logic enforces rules vs shutdown exclusivity and TTL constraints.
145+
- onSuccess/onFailure are deprecated; migration to deletionRules is encouraged.
92146

93147

94148

@@ -97,8 +151,9 @@ _Appears in:_
97151

98152
| Field | Description | Default | Validation |
99153
| --- | --- | --- | --- |
100-
| `onSuccess` _[DeletionPolicy](#deletionpolicy)_ | | | |
101-
| `onFailure` _[DeletionPolicy](#deletionpolicy)_ | | | |
154+
| `onSuccess` _[DeletionPolicy](#deletionpolicy)_ | OnSuccess is the deletion policy for a successful RayJob.<br />Deprecated: Use `deletionRules` instead for more flexible, multi-stage deletion strategies.<br />This field will be removed in release 1.6.0. | | |
155+
| `onFailure` _[DeletionPolicy](#deletionpolicy)_ | OnFailure is the deletion policy for a failed RayJob.<br />Deprecated: Use `deletionRules` instead for more flexible, multi-stage deletion strategies.<br />This field will be removed in release 1.6.0. | | |
156+
| `deletionRules` _[DeletionRule](#deletionrule) array_ | DeletionRules is a list of deletion rules, processed based on their trigger conditions.<br />While the rules can be used to define a sequence, if multiple rules are overdue (e.g., due to controller downtime),<br />the most impactful rule (e.g., DeleteSelf) will be executed first to prioritize resource cleanup. | | MinItems: 1 <br /> |
102157

103158

104159

@@ -242,7 +297,7 @@ _Appears in:_
242297
| `clusterSelector` _object (keys:string, values:string)_ | clusterSelector is used to select running rayclusters by labels | | |
243298
| `submitterConfig` _[SubmitterConfig](#submitterconfig)_ | Configurations of submitter k8s job. | | |
244299
| `managedBy` _string_ | ManagedBy is an optional configuration for the controller or entity that manages a RayJob.<br />The value must be either 'ray.io/kuberay-operator' or 'kueue.x-k8s.io/multikueue'.<br />The kuberay-operator reconciles a RayJob which doesn't have this field at all or<br />the field value is the reserved string 'ray.io/kuberay-operator',<br />but delegates reconciling the RayJob with 'kueue.x-k8s.io/multikueue' to the Kueue.<br />The field is immutable. | | |
245-
| `deletionStrategy` _[DeletionStrategy](#deletionstrategy)_ | DeletionStrategy indicates what resources of the RayJob and how they are deleted upon job completion.<br />If unset, deletion policy is based on 'spec.shutdownAfterJobFinishes'.<br />This field requires the RayJobDeletionPolicy feature gate to be enabled. | | |
300+
| `deletionStrategy` _[DeletionStrategy](#deletionstrategy)_ | DeletionStrategy automates post-completion cleanup.<br />Choose one style or omit:<br /> - Legacy: both onSuccess & onFailure (deprecated; may combine with shutdownAfterJobFinishes and TTLSecondsAfterFinished).<br /> - Rules: deletionRules (non-empty) — incompatible with shutdownAfterJobFinishes, legacy fields, and global TTLSecondsAfterFinished (use per-rule condition.ttlSeconds).<br />Global TTLSecondsAfterFinished > 0 requires shutdownAfterJobFinishes=true.<br />Feature gate RayJobDeletionPolicy must be enabled when this field is set. | | |
246301
| `entrypoint` _string_ | Entrypoint represents the command to start execution. | | |
247302
| `runtimeEnvYAML` _string_ | RuntimeEnvYAML represents the runtime environment configuration<br />provided as a multi-line YAML string. | | |
248303
| `jobId` _string_ | If jobId is not set, a new jobId will be auto-generated. | | |

helm-chart/kuberay-operator/crds/ray.io_rayjobs.yaml

Lines changed: 49 additions & 17 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

ray-operator/Makefile

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,6 @@ test: ENVTEST_K8S_VERSION ?= 1.24.2
6767
test: manifests fmt vet envtest ## Run tests.
6868
KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" go test $(WHAT) -coverprofile cover.out
6969

70-
# You can use `go test -timeout 30m -v ./test/e2e/rayjob_test.go ./test/e2e/support.go` if you only want to run tests in `rayjob_test.go`.
7170
test-e2e: WHAT ?= ./test/e2e
7271
test-e2e: manifests fmt vet ## Run e2e tests.
7372
go test -timeout 30m -v $(WHAT)
@@ -88,6 +87,14 @@ test-sampleyaml: WHAT ?= ./test/sampleyaml
8887
test-sampleyaml: manifests fmt vet
8988
go test -timeout 30m -v $(WHAT)
9089

90+
test-e2e-rayjob: WHAT ?= ./test/e2erayjob
91+
test-e2e-rayjob: manifests fmt vet ## Run e2e tests.
92+
go test -timeout 30m -v $(WHAT)
93+
94+
test-e2e-rayservice: WHAT ?= ./test/e2erayservice
95+
test-e2e-rayservice: manifests fmt vet ## Run e2e tests.
96+
go test -timeout 30m -v $(WHAT)
97+
9198
sync: helm api-docs
9299
./hack/update-codegen.sh
93100

@@ -136,6 +143,15 @@ deploy: manifests kustomize ## Deploy controller to the K8s cluster specified in
136143
cd config/default && $(KUSTOMIZE) edit set image kuberay/operator=${IMG}
137144
$(KUSTOMIZE) build config/default | kubectl apply --server-side=true -f -
138145

146+
# NOTE FOR CONTRIBUTORS:
147+
# deploy-with-override is an e2e/CI-only deployment path. It applies a Kustomize overlay that
148+
# enables test-only feature gates (e.g. RayJobDeletionPolicy) without changing the default
149+
# behavior of the base Helm chart or the standard 'make deploy'. Add additional test overrides
150+
# to the overlay (config/overlays/rayjob-deletion-policy) rather than modifying the base.
151+
deploy-with-override: manifests kustomize ## Deploy controller with test-only feature gate overrides (does NOT affect default chart).
152+
cd config/default && $(KUSTOMIZE) edit set image kuberay/operator=${IMG}
153+
$(KUSTOMIZE) build config/overlays/test-overrides | kubectl apply --server-side=true -f -
154+
139155
undeploy: ## Undeploy controller from the K8s cluster specified in ~/.kube/config.
140156
$(KUSTOMIZE) build config/default | kubectl delete -f -
141157

0 commit comments

Comments
 (0)