Conversation
Co-authored-by: JesseStutler <38534065+JesseStutler@users.noreply.github.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds missing end-user documentation for the HyperJob CRD in volcano-global, including a user guide and example manifests to help users apply, monitor, and troubleshoot multi-cluster AI training jobs.
Changes:
- Added a HyperJob user guide with prerequisites, API/status explanations, and usage/troubleshooting commands.
- Added three annotated HyperJob example YAML manifests covering job splitting, heterogeneous clusters, and cluster-affinity scenarios.
- Documented child resource naming conventions and linked to the controller design doc and related references.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| docs/user-guide/hyperjob-user-guide.md | New end-user guide describing HyperJob usage, API fields, status semantics, and troubleshooting. |
| docs/deploy/example/hyperjob/llm-training.yaml | Example manifest for splitting a large GPU training job across clusters. |
| docs/deploy/example/hyperjob/heterogeneous-training.yaml | Example manifest showing accelerator/cluster targeting with clusterNames + node affinity. |
| docs/deploy/example/hyperjob/cluster-affinity-training.yaml | Example manifest for multi-cluster placement and restart policy behavior (wording needs alignment with current controller behavior). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # A team runs a fault-tolerant training job across three clusters with a minimum | ||
| # availability requirement. The job uses clusterNames to pin each replica to a | ||
| # specific cluster, while minAvailable allows the HyperJob to continue even if | ||
| # one replica fails. | ||
| # | ||
| # The controller creates three VCJobs (resilient-training-trainer-0, resilient-training-trainer-1, | ||
| # resilient-training-trainer-2), each scheduled to their respective preferred cluster. | ||
| # The HyperJob is considered healthy as long as at least 2 out of 3 jobs are running. |
There was a problem hiding this comment.
The header comment says clusterNames pins each replica to a specific cluster and that each VCJob will be scheduled to its “respective” preferred cluster. In practice the controller sets clusterAffinity.clusterNames for every generated PropagationPolicy, which only restricts eligible clusters; it does not guarantee a 1:1 mapping between replicas and clusters or that replicas land on distinct clusters. Please adjust the wording to avoid over-promising placement behavior.
| pending: <int> | ||
| running: <int> | ||
| succeeded: <int> | ||
| failed: <int> | ||
| terminating: <int> | ||
| unknown: <int> | ||
| ``` |
There was a problem hiding this comment.
pending/running/succeeded/failed/... here are aggregated from each child VCJob’s status.{pending,running,...} fields (i.e., task/pod counts in Volcano JobStatus), not counts of child VCJobs. Consider clarifying this in the snippet/description to prevent users from interpreting running: 256 as “256 VCJobs”.
| minAvailable: <int> # Minimum number of VCJobs that must be running (optional) | ||
| replicatedJobs: # List of job templates | ||
| - name: <string> # Unique name for this job group within the HyperJob | ||
| replicas: <int> # Number of VCJobs to create from this template (default: 1) | ||
| clusterNames: # Preferred cluster names for scheduling (optional) | ||
| - <cluster-name> | ||
| templateSpec: # Volcano Job spec template | ||
| <VCJob spec fields> | ||
| ``` | ||
|
|
||
| ### Field Descriptions | ||
|
|
||
| | Field | Type | Required | Description | | ||
| |-------|------|----------|-------------| | ||
| | `spec.minAvailable` | int32 | No | Minimum number of VCJobs that must be running. Reserved for future fault-tolerance use; currently not enforced by the controller. | | ||
| | `spec.replicatedJobs` | []ReplicatedJob | Yes | List of job templates. Each entry defines a group of identical VCJobs. | | ||
| | `replicatedJobs[].name` | string | Yes | Unique identifier for this replicated job within the HyperJob. | | ||
| | `replicatedJobs[].replicas` | int32 | No | Number of VCJobs to create from this template. Defaults to 1. | | ||
| | `replicatedJobs[].clusterNames` | []string | No | Preferred clusters for scheduling. The controller attempts to place replicas on these clusters first. | | ||
| | `replicatedJobs[].templateSpec` | JobSpec | Yes | The Volcano Job specification used as a template for creating child VCJobs. | |
There was a problem hiding this comment.
The API reference/field table claims to describe HyperJobSpec/ReplicatedJob, but it omits fields that exist in the shipped CRD (e.g., spec.maxDomains, spec.plugins, and replicatedJobs[].splitPolicy). Either document these fields or explicitly state the table is a partial subset so readers don’t assume they don’t exist.
| | `replicatedJobs[].replicas` | int32 | No | Number of VCJobs to create from this template. Defaults to 1. | | ||
| | `replicatedJobs[].clusterNames` | []string | No | Preferred clusters for scheduling. The controller attempts to place replicas on these clusters first. | | ||
| | `replicatedJobs[].templateSpec` | JobSpec | Yes | The Volcano Job specification used as a template for creating child VCJobs. | |
There was a problem hiding this comment.
clusterNames is described here as a preference list ("attempts to place replicas on these clusters first"), but the controller generates a Karmada clusterAffinity.clusterNames, which is a hard constraint/restriction to those clusters (not an ordered preference). Update the description to reflect the actual behavior.
| conditions: # Set only when ALL child VCJobs are in terminal states | ||
| - type: Completed # All child VCJobs completed successfully | ||
| status: "True" | ||
| - type: Failed # All child VCJobs finished but at least one failed | ||
| status: "True" | ||
| replicatedJobsStatus: # Per-replicated-job status |
There was a problem hiding this comment.
This status example shows both Completed and Failed conditions set to True at the same time. The controller sets a single terminal condition (either Completed or Failed) once all child VCJobs are finished. Adjust the example to show only one terminal condition being present.
| - name: <replicatedjob-name> | ||
| jobStates: | ||
| <vcjob-name>: <phase> # Phase of each individual VCJob | ||
| pending: <int> |
There was a problem hiding this comment.
The jobStates map values are not just a phase string. In the CRD/controller they are a JobState object (e.g., phase, reason, message, lastTransitionTime). Consider updating the snippet to reflect the actual structure so users know what to expect in status.replicatedJobsStatus[].jobStates.
| Run a fault-tolerant training job across three clusters with `minAvailable: 2`, | ||
| so the HyperJob remains healthy even if one cluster becomes unavailable. | ||
|
|
There was a problem hiding this comment.
This case describes spec.minAvailable as enabling fault-tolerance/health semantics ("remains healthy even if one cluster becomes unavailable"), but HyperJob.spec.minAvailable is currently not enforced by the controller. Please reword this section to avoid implying it affects scheduling or job continuity today (or clearly mark it as future/placeholder behavior).
| # resilient-training-trainer-2), each scheduled to their respective preferred cluster. | ||
| # The HyperJob is considered healthy as long as at least 2 out of 3 jobs are running. |
There was a problem hiding this comment.
This example text implies spec.minAvailable: 2 makes the HyperJob “considered healthy” while 2/3 jobs are running, but HyperJob.spec.minAvailable is currently not enforced by the controller. Consider removing/softening this claim or adding a note that it’s reserved for future fault-tolerance behavior.
| minAvailable: <int> # Minimum number of VCJobs that must be running (optional) | ||
| replicatedJobs: # List of job templates | ||
| - name: <string> # Unique name for this job group within the HyperJob | ||
| replicas: <int> # Number of VCJobs to create from this template (default: 1) |
There was a problem hiding this comment.
The inline comment in the API YAML (minAvailable: ... # Minimum number of VCJobs that must be running) reads like an enforced guarantee, but later the guide notes this field is currently not enforced by the controller. Consider aligning the comment here with that reality (e.g., mark it as reserved/future behavior).
No user-facing documentation existed for HyperJob despite the CRD and controller design doc being present. This adds a user guide and annotated example YAMLs covering the primary use cases.
User guide (
docs/user-guide/hyperjob-user-guide.md)HyperJobSpec/ReplicatedJobfieldsreplicatedJobsStatusaggregates{hyperjob-name}-{replicatedjob-name}-{index}kubectlcommands for apply / monitor / deleteExample manifests (
docs/deploy/example/hyperjob/)llm-training.yamlreplicas: 2, 128 workers each)heterogeneous-training.yamlclusterNamespinningcluster-affinity-training.yamlminAvailable: 2andPodEvicted → RestartJobExample — large-scale job splitting:
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.