Skip to content

Add HyperJob user guide and example manifests#63

Open
Copilot wants to merge 2 commits intomainfrom
copilot/add-user-guides-hyperjob
Open

Add HyperJob user guide and example manifests#63
Copilot wants to merge 2 commits intomainfrom
copilot/add-user-guides-hyperjob

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 5, 2026

No user-facing documentation existed for HyperJob despite the CRD and controller design doc being present. This adds a user guide and annotated example YAMLs covering the primary use cases.

User guide (docs/user-guide/hyperjob-user-guide.md)

  • Prerequisites and CRD installation
  • Full API reference table for HyperJobSpec / ReplicatedJob fields
  • Status semantics — when conditions are/aren't set, what replicatedJobsStatus aggregates
  • Child resource naming convention: {hyperjob-name}-{replicatedjob-name}-{index}
  • Three annotated use cases with inline YAML (see below)
  • kubectl commands for apply / monitor / delete
  • Troubleshooting section (missing CRD, VCJobs not scheduled, stuck conditions)

Example manifests (docs/deploy/example/hyperjob/)

File Use case
llm-training.yaml Split 256-GPU job across two clusters (replicas: 2, 128 workers each)
heterogeneous-training.yaml Ascend 910B + 910C clusters with node affinity and clusterNames pinning
cluster-affinity-training.yaml 3-cluster fault-tolerant job with minAvailable: 2 and PodEvicted → RestartJob

Example — large-scale job splitting:

apiVersion: training.volcano.sh/v1alpha1
kind: HyperJob
metadata:
  name: llm-training
spec:
  minAvailable: 2
  replicatedJobs:
  - name: trainer
    replicas: 2          # creates llm-training-trainer-0 and llm-training-trainer-1
    templateSpec:
      tasks:
      - name: worker
        replicas: 128
        template:
          spec:
            containers:
            - name: trainer
              image: training-image:v1
              resources:
                requests:
                  nvidia.com/gpu: "1"
Original prompt

This section details on the original issue you should resolve

<issue_title>Add user guide and examples for HyperJob</issue_title>
<issue_description>Currently, there are no user guides or practical examples for HyperJob in the docs directory. While the repository contains the CRD definition (docs/deploy/training.volcano.sh_hyperjobs.yaml) and the controller design document (docs/proposals/hyperjob-controller-design.md), users need clear guidance on how to actually use this feature.

We should add:

  1. A comprehensive user guide for HyperJob in volcano-global.
  2. Example YAML manifests demonstrating different HyperJob configurations.

The documentation and examples can be based on the use cases described in the HyperJob API proposal from the main volcano repository.</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: JesseStutler <38534065+JesseStutler@users.noreply.github.com>
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jessestutler. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI changed the title [WIP] Add user guide and examples for HyperJob Add HyperJob user guide and example manifests Mar 5, 2026
@JesseStutler JesseStutler marked this pull request as ready for review March 10, 2026 09:01
Copilot AI review requested due to automatic review settings March 10, 2026 09:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds missing end-user documentation for the HyperJob CRD in volcano-global, including a user guide and example manifests to help users apply, monitor, and troubleshoot multi-cluster AI training jobs.

Changes:

  • Added a HyperJob user guide with prerequisites, API/status explanations, and usage/troubleshooting commands.
  • Added three annotated HyperJob example YAML manifests covering job splitting, heterogeneous clusters, and cluster-affinity scenarios.
  • Documented child resource naming conventions and linked to the controller design doc and related references.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
docs/user-guide/hyperjob-user-guide.md New end-user guide describing HyperJob usage, API fields, status semantics, and troubleshooting.
docs/deploy/example/hyperjob/llm-training.yaml Example manifest for splitting a large GPU training job across clusters.
docs/deploy/example/hyperjob/heterogeneous-training.yaml Example manifest showing accelerator/cluster targeting with clusterNames + node affinity.
docs/deploy/example/hyperjob/cluster-affinity-training.yaml Example manifest for multi-cluster placement and restart policy behavior (wording needs alignment with current controller behavior).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +3 to +10
# A team runs a fault-tolerant training job across three clusters with a minimum
# availability requirement. The job uses clusterNames to pin each replica to a
# specific cluster, while minAvailable allows the HyperJob to continue even if
# one replica fails.
#
# The controller creates three VCJobs (resilient-training-trainer-0, resilient-training-trainer-1,
# resilient-training-trainer-2), each scheduled to their respective preferred cluster.
# The HyperJob is considered healthy as long as at least 2 out of 3 jobs are running.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment says clusterNames pins each replica to a specific cluster and that each VCJob will be scheduled to its “respective” preferred cluster. In practice the controller sets clusterAffinity.clusterNames for every generated PropagationPolicy, which only restricts eligible clusters; it does not guarantee a 1:1 mapping between replicas and clusters or that replicas land on distinct clusters. Please adjust the wording to avoid over-promising placement behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +85
pending: <int>
running: <int>
succeeded: <int>
failed: <int>
terminating: <int>
unknown: <int>
```
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending/running/succeeded/failed/... here are aggregated from each child VCJob’s status.{pending,running,...} fields (i.e., task/pod counts in Volcano JobStatus), not counts of child VCJobs. Consider clarifying this in the snippet/description to prevent users from interpreting running: 256 as “256 VCJobs”.

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +60
minAvailable: <int> # Minimum number of VCJobs that must be running (optional)
replicatedJobs: # List of job templates
- name: <string> # Unique name for this job group within the HyperJob
replicas: <int> # Number of VCJobs to create from this template (default: 1)
clusterNames: # Preferred cluster names for scheduling (optional)
- <cluster-name>
templateSpec: # Volcano Job spec template
<VCJob spec fields>
```

### Field Descriptions

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `spec.minAvailable` | int32 | No | Minimum number of VCJobs that must be running. Reserved for future fault-tolerance use; currently not enforced by the controller. |
| `spec.replicatedJobs` | []ReplicatedJob | Yes | List of job templates. Each entry defines a group of identical VCJobs. |
| `replicatedJobs[].name` | string | Yes | Unique identifier for this replicated job within the HyperJob. |
| `replicatedJobs[].replicas` | int32 | No | Number of VCJobs to create from this template. Defaults to 1. |
| `replicatedJobs[].clusterNames` | []string | No | Preferred clusters for scheduling. The controller attempts to place replicas on these clusters first. |
| `replicatedJobs[].templateSpec` | JobSpec | Yes | The Volcano Job specification used as a template for creating child VCJobs. |
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API reference/field table claims to describe HyperJobSpec/ReplicatedJob, but it omits fields that exist in the shipped CRD (e.g., spec.maxDomains, spec.plugins, and replicatedJobs[].splitPolicy). Either document these fields or explicitly state the table is a partial subset so readers don’t assume they don’t exist.

Copilot uses AI. Check for mistakes.
Comment on lines +58 to +60
| `replicatedJobs[].replicas` | int32 | No | Number of VCJobs to create from this template. Defaults to 1. |
| `replicatedJobs[].clusterNames` | []string | No | Preferred clusters for scheduling. The controller attempts to place replicas on these clusters first. |
| `replicatedJobs[].templateSpec` | JobSpec | Yes | The Volcano Job specification used as a template for creating child VCJobs. |
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterNames is described here as a preference list ("attempts to place replicas on these clusters first"), but the controller generates a Karmada clusterAffinity.clusterNames, which is a hard constraint/restriction to those clusters (not an ordered preference). Update the description to reflect the actual behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +75
conditions: # Set only when ALL child VCJobs are in terminal states
- type: Completed # All child VCJobs completed successfully
status: "True"
- type: Failed # All child VCJobs finished but at least one failed
status: "True"
replicatedJobsStatus: # Per-replicated-job status
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This status example shows both Completed and Failed conditions set to True at the same time. The controller sets a single terminal condition (either Completed or Failed) once all child VCJobs are finished. Adjust the example to show only one terminal condition being present.

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +79
- name: <replicatedjob-name>
jobStates:
<vcjob-name>: <phase> # Phase of each individual VCJob
pending: <int>
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jobStates map values are not just a phase string. In the CRD/controller they are a JobState object (e.g., phase, reason, message, lastTransitionTime). Consider updating the snippet to reflect the actual structure so users know what to expect in status.replicatedJobsStatus[].jobStates.

Copilot uses AI. Check for mistakes.
Comment on lines +232 to +234
Run a fault-tolerant training job across three clusters with `minAvailable: 2`,
so the HyperJob remains healthy even if one cluster becomes unavailable.

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case describes spec.minAvailable as enabling fault-tolerance/health semantics ("remains healthy even if one cluster becomes unavailable"), but HyperJob.spec.minAvailable is currently not enforced by the controller. Please reword this section to avoid implying it affects scheduling or job continuity today (or clearly mark it as future/placeholder behavior).

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +10
# resilient-training-trainer-2), each scheduled to their respective preferred cluster.
# The HyperJob is considered healthy as long as at least 2 out of 3 jobs are running.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example text implies spec.minAvailable: 2 makes the HyperJob “considered healthy” while 2/3 jobs are running, but HyperJob.spec.minAvailable is currently not enforced by the controller. Consider removing/softening this claim or adding a note that it’s reserved for future fault-tolerance behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +44
minAvailable: <int> # Minimum number of VCJobs that must be running (optional)
replicatedJobs: # List of job templates
- name: <string> # Unique name for this job group within the HyperJob
replicas: <int> # Number of VCJobs to create from this template (default: 1)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment in the API YAML (minAvailable: ... # Minimum number of VCJobs that must be running) reads like an enforced guarantee, but later the guide notes this field is currently not enforced by the controller. Consider aligning the comment here with that reality (e.g., mark it as reserved/future behavior).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add user guide and examples for HyperJob

4 participants