Add HyperJob user guide and example manifests by Copilot · Pull Request #63 · volcano-sh/volcano-global

Copilot · 2026-03-05T02:46:40Z

No user-facing documentation existed for HyperJob despite the CRD and controller design doc being present. This adds a user guide and annotated example YAMLs covering the primary use cases.

User guide (`docs/user-guide/hyperjob-user-guide.md`)

Prerequisites and CRD installation
Full API reference table for HyperJobSpec / ReplicatedJob fields
Status semantics — when conditions are/aren't set, what replicatedJobsStatus aggregates
Child resource naming convention: {hyperjob-name}-{replicatedjob-name}-{index}
Three annotated use cases with inline YAML (see below)
kubectl commands for apply / monitor / delete
Troubleshooting section (missing CRD, VCJobs not scheduled, stuck conditions)

Example manifests (`docs/deploy/example/hyperjob/`)

File	Use case
`llm-training.yaml`	Split 256-GPU job across two clusters (`replicas: 2`, 128 workers each)
`heterogeneous-training.yaml`	Ascend 910B + 910C clusters with node affinity and `clusterNames` pinning
`cluster-affinity-training.yaml`	3-cluster fault-tolerant job with `minAvailable: 2` and `PodEvicted → RestartJob`

Example — large-scale job splitting:

apiVersion: training.volcano.sh/v1alpha1
kind: HyperJob
metadata:
  name: llm-training
spec:
  minAvailable: 2
  replicatedJobs:
  - name: trainer
    replicas: 2          # creates llm-training-trainer-0 and llm-training-trainer-1
    templateSpec:
      tasks:
      - name: worker
        replicas: 128
        template:
          spec:
            containers:
            - name: trainer
              image: training-image:v1
              resources:
                requests:
                  nvidia.com/gpu: "1"

Original prompt

This section details on the original issue you should resolve

<issue_title>Add user guide and examples for HyperJob</issue_title>
<issue_description>Currently, there are no user guides or practical examples for HyperJob in the docs directory. While the repository contains the CRD definition (docs/deploy/training.volcano.sh_hyperjobs.yaml) and the controller design document (docs/proposals/hyperjob-controller-design.md), users need clear guidance on how to actually use this feature.

We should add:

A comprehensive user guide for HyperJob in volcano-global.

Example YAML manifests demonstrating different HyperJob configurations.

The documentation and examples can be based on the use cases described in the HyperJob API proposal from the main volcano repository.</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes Add user guide and examples for HyperJob #62

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: JesseStutler <38534065+JesseStutler@users.noreply.github.com>

volcano-sh-bot · 2026-03-05T02:49:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jessestutler. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Adds missing end-user documentation for the HyperJob CRD in volcano-global, including a user guide and example manifests to help users apply, monitor, and troubleshoot multi-cluster AI training jobs.

Changes:

Added a HyperJob user guide with prerequisites, API/status explanations, and usage/troubleshooting commands.
Added three annotated HyperJob example YAML manifests covering job splitting, heterogeneous clusters, and cluster-affinity scenarios.
Documented child resource naming conventions and linked to the controller design doc and related references.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File	Description
docs/user-guide/hyperjob-user-guide.md	New end-user guide describing HyperJob usage, API fields, status semantics, and troubleshooting.
docs/deploy/example/hyperjob/llm-training.yaml	Example manifest for splitting a large GPU training job across clusters.
docs/deploy/example/hyperjob/heterogeneous-training.yaml	Example manifest showing accelerator/cluster targeting with clusterNames + node affinity.
docs/deploy/example/hyperjob/cluster-affinity-training.yaml	Example manifest for multi-cluster placement and restart policy behavior (wording needs alignment with current controller behavior).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T09:07:15Z

docs/deploy/example/hyperjob/cluster-affinity-training.yaml

+# A team runs a fault-tolerant training job across three clusters with a minimum
+# availability requirement. The job uses clusterNames to pin each replica to a
+# specific cluster, while minAvailable allows the HyperJob to continue even if
+# one replica fails.
+#
+# The controller creates three VCJobs (resilient-training-trainer-0, resilient-training-trainer-1,
+# resilient-training-trainer-2), each scheduled to their respective preferred cluster.
+# The HyperJob is considered healthy as long as at least 2 out of 3 jobs are running.


The header comment says clusterNames pins each replica to a specific cluster and that each VCJob will be scheduled to its “respective” preferred cluster. In practice the controller sets clusterAffinity.clusterNames for every generated PropagationPolicy, which only restricts eligible clusters; it does not guarantee a 1:1 mapping between replicas and clusters or that replicas land on distinct clusters. Please adjust the wording to avoid over-promising placement behavior.

Copilot · 2026-03-10T09:07:15Z

docs/user-guide/hyperjob-user-guide.md

+    pending: <int>
+    running: <int>
+    succeeded: <int>
+    failed: <int>
+    terminating: <int>
+    unknown: <int>
+```


pending/running/succeeded/failed/... here are aggregated from each child VCJob’s status.{pending,running,...} fields (i.e., task/pod counts in Volcano JobStatus), not counts of child VCJobs. Consider clarifying this in the snippet/description to prevent users from interpreting running: 256 as “256 VCJobs”.

Copilot · 2026-03-10T09:07:15Z

docs/user-guide/hyperjob-user-guide.md

+  minAvailable: <int>          # Minimum number of VCJobs that must be running (optional)
+  replicatedJobs:              # List of job templates
+  - name: <string>             # Unique name for this job group within the HyperJob
+    replicas: <int>            # Number of VCJobs to create from this template (default: 1)
+    clusterNames:              # Preferred cluster names for scheduling (optional)
+    - <cluster-name>
+    templateSpec:              # Volcano Job spec template
+      <VCJob spec fields>
+```
+
+### Field Descriptions
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `spec.minAvailable` | int32 | No | Minimum number of VCJobs that must be running. Reserved for future fault-tolerance use; currently not enforced by the controller. |
+| `spec.replicatedJobs` | []ReplicatedJob | Yes | List of job templates. Each entry defines a group of identical VCJobs. |
+| `replicatedJobs[].name` | string | Yes | Unique identifier for this replicated job within the HyperJob. |
+| `replicatedJobs[].replicas` | int32 | No | Number of VCJobs to create from this template. Defaults to 1. |
+| `replicatedJobs[].clusterNames` | []string | No | Preferred clusters for scheduling. The controller attempts to place replicas on these clusters first. |
+| `replicatedJobs[].templateSpec` | JobSpec | Yes | The Volcano Job specification used as a template for creating child VCJobs. |


The API reference/field table claims to describe HyperJobSpec/ReplicatedJob, but it omits fields that exist in the shipped CRD (e.g., spec.maxDomains, spec.plugins, and replicatedJobs[].splitPolicy). Either document these fields or explicitly state the table is a partial subset so readers don’t assume they don’t exist.

Copilot · 2026-03-10T09:07:15Z

docs/user-guide/hyperjob-user-guide.md

+| `replicatedJobs[].replicas` | int32 | No | Number of VCJobs to create from this template. Defaults to 1. |
+| `replicatedJobs[].clusterNames` | []string | No | Preferred clusters for scheduling. The controller attempts to place replicas on these clusters first. |
+| `replicatedJobs[].templateSpec` | JobSpec | Yes | The Volcano Job specification used as a template for creating child VCJobs. |


clusterNames is described here as a preference list ("attempts to place replicas on these clusters first"), but the controller generates a Karmada clusterAffinity.clusterNames, which is a hard constraint/restriction to those clusters (not an ordered preference). Update the description to reflect the actual behavior.

Copilot · 2026-03-10T09:07:16Z

docs/user-guide/hyperjob-user-guide.md

+  conditions:                   # Set only when ALL child VCJobs are in terminal states
+  - type: Completed             # All child VCJobs completed successfully
+    status: "True"
+  - type: Failed                # All child VCJobs finished but at least one failed
+    status: "True"
+  replicatedJobsStatus:         # Per-replicated-job status


This status example shows both Completed and Failed conditions set to True at the same time. The controller sets a single terminal condition (either Completed or Failed) once all child VCJobs are finished. Adjust the example to show only one terminal condition being present.

Copilot · 2026-03-10T09:07:16Z

docs/user-guide/hyperjob-user-guide.md

+  - name: <replicatedjob-name>
+    jobStates:
+      <vcjob-name>: <phase>     # Phase of each individual VCJob
+    pending: <int>


The jobStates map values are not just a phase string. In the CRD/controller they are a JobState object (e.g., phase, reason, message, lastTransitionTime). Consider updating the snippet to reflect the actual structure so users know what to expect in status.replicatedJobsStatus[].jobStates.

Copilot · 2026-03-10T09:07:16Z

docs/user-guide/hyperjob-user-guide.md

+Run a fault-tolerant training job across three clusters with `minAvailable: 2`,
+so the HyperJob remains healthy even if one cluster becomes unavailable.
+


This case describes spec.minAvailable as enabling fault-tolerance/health semantics ("remains healthy even if one cluster becomes unavailable"), but HyperJob.spec.minAvailable is currently not enforced by the controller. Please reword this section to avoid implying it affects scheduling or job continuity today (or clearly mark it as future/placeholder behavior).

Copilot · 2026-03-10T09:07:16Z

docs/deploy/example/hyperjob/cluster-affinity-training.yaml

+# resilient-training-trainer-2), each scheduled to their respective preferred cluster.
+# The HyperJob is considered healthy as long as at least 2 out of 3 jobs are running.


This example text implies spec.minAvailable: 2 makes the HyperJob “considered healthy” while 2/3 jobs are running, but HyperJob.spec.minAvailable is currently not enforced by the controller. Consider removing/softening this claim or adding a note that it’s reserved for future fault-tolerance behavior.

Copilot · 2026-03-10T09:07:16Z

docs/user-guide/hyperjob-user-guide.md

+  minAvailable: <int>          # Minimum number of VCJobs that must be running (optional)
+  replicatedJobs:              # List of job templates
+  - name: <string>             # Unique name for this job group within the HyperJob
+    replicas: <int>            # Number of VCJobs to create from this template (default: 1)


The inline comment in the API YAML (minAvailable: ... # Minimum number of VCJobs that must be running) reads like an enforced guarantee, but later the guide notes this field is currently not enforced by the controller. Consider aligning the comment here with that reality (e.g., mark it as reserved/future behavior).

Initial plan

48bb2b8

Copilot AI assigned Copilot and JesseStutler Mar 5, 2026

volcano-sh-bot added the do-not-merge/work-in-progress label Mar 5, 2026

Copilot started work on behalf of JesseStutler March 5, 2026 02:46 View session

volcano-sh-bot added the size/XS label Mar 5, 2026

Add user guide and examples for HyperJob

eba682a

Co-authored-by: JesseStutler <38534065+JesseStutler@users.noreply.github.com>

volcano-sh-bot added size/XL and removed size/XS labels Mar 5, 2026

Copilot AI changed the title ~~[WIP] Add user guide and examples for HyperJob~~ Add HyperJob user guide and example manifests Mar 5, 2026

Copilot finished work on behalf of JesseStutler March 5, 2026 02:50

JesseStutler marked this pull request as ready for review March 10, 2026 09:01

Copilot AI review requested due to automatic review settings March 10, 2026 09:01

volcano-sh-bot removed the do-not-merge/work-in-progress label Mar 10, 2026

volcano-sh-bot requested review from JesseStutler and Monokaix March 10, 2026 09:01

Copilot started reviewing on behalf of JesseStutler March 10, 2026 09:02 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HyperJob user guide and example manifests#63

Add HyperJob user guide and example manifests#63
Copilot wants to merge 2 commits intomainfrom
copilot/add-user-guides-hyperjob

Copilot AI commented Mar 5, 2026 •

edited

Loading

Uh oh!

volcano-sh-bot commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		Run a fault-tolerant training job across three clusters with `minAvailable: 2`,
		so the HyperJob remains healthy even if one cluster becomes unavailable.

		# resilient-training-trainer-2), each scheduled to their respective preferred cluster.
		# The HyperJob is considered healthy as long as at least 2 out of 3 jobs are running.

Conversation

Copilot AI commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User guide (docs/user-guide/hyperjob-user-guide.md)

Example manifests (docs/deploy/example/hyperjob/)

Comments on the Issue (you are @copilot in this section)

Uh oh!

volcano-sh-bot commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Mar 5, 2026 •

edited

Loading

User guide (`docs/user-guide/hyperjob-user-guide.md`)

Example manifests (`docs/deploy/example/hyperjob/`)