Skip to content

Conversation

@rawkode
Copy link

@rawkode rawkode commented Jan 16, 2026

Using CUE, we now provide a schema for manifest validation and generation based on GPU type.

This can be utilised via the nccl.nu script that supports running a sync against a kubeconfig to
create or verify GPU information from a remote
cluster.

You can also use generate with flags to get a
dynamic Kubernetes manifest for running the nccl tests

Using CUE, we now provide a schema for manifest validation
and generation based on GPU type.

This can be utilised via the `nccl.nu` script that
supports running a `sync` against a kubeconfig to
create or verify GPU information from a remote
cluster.

You can also use `generate` with flags to get a
dynamic Kubernetes manifest for running the nccl tests
Copilot AI review requested due to automatic review settings January 16, 2026 15:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces CUE-based schema validation and manifest generation for NCCL tests. It adds a Nushell CLI script (nccl.nu) that can sync GPU configurations from a Kubernetes cluster and generate dynamic Kubernetes manifests for running NCCL tests based on GPU type and scale parameters.

Changes:

  • Added CUE schema for GPU configuration validation
  • Created Nushell CLI script for cluster synchronization and manifest generation
  • Added GPU configuration files for multiple GPU types (A100, H100, A40, GB200, RTX P6000)

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
schema/gpu.cue Defines the GPU configuration schema with validation constraints
utils/new_gpu.cue Provides template for generating new GPU configurations from cluster data
gpus/*.cue Configuration files for specific GPU types with hardware specifications
generate.cue Main manifest generation logic with conditional feature flags
nccl.nu Nushell CLI script for sync and generate commands
cue.mod/module.cue CUE module definition with Kubernetes schema dependencies
README.md Documentation for the new manifest generation feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rawkode
Copy link
Author

rawkode commented Jan 16, 2026

Example of a sync:

❯ ./nccl.nu sync
Fetching GPU nodes from cluster...
Reading GPU configs from CUE...

╭───┬──────────────────┬─────────┬─────────────────╮
│ # │     GPU Type     │ Cluster │     Config      │
├───┼──────────────────┼─────────┼─────────────────┤
│ 0 │ A40              │ ✗       │ ✓ (a40)         │
│ 1 │ gb200-4x         │ ✗       │ ✓ (gb200)       │
│ 2 │ gd-8xa100-i128   │ ✗       │ ✓ (a100)        │
│ 3 │ gd-8xh100ib-i128 │ ✗       │ ✓ (h100)        │
│ 4 │ rtxp6000-8x      │ ✗       │ ✓ (rtxp6000_8x) │
╰───┴──────────────────┴─────────┴─────────────────╯


All cluster GPU types have configs ✓

nccl-tests feat/cue-nu*
❯ ./nccl.nu sync --kubeconfig=$HOME/Downloads/kubeconfig-bd
Fetching GPU nodes from cluster...
Reading GPU configs from CUE...

╭───┬──────────────────┬─────────┬─────────────────╮
│ # │     GPU Type     │ Cluster │     Config      │
├───┼──────────────────┼─────────┼─────────────────┤
│ 0 │ A40              │ ✗       │ ✓ (a40)         │
│ 1 │ gb200-4x         │ ✗       │ ✓ (gb200)       │
│ 2 │ gd-8xa100-i128   │ ✗       │ ✓ (a100)        │
│ 3 │ gd-8xh100ib-i128 │ ✗       │ ✓ (h100)        │
│ 4 │ rtxp6000-8x      │ ✓       │ ✓ (rtxp6000_8x) │
╰───┴──────────────────┴─────────┴─────────────────╯


All cluster GPU types have configs ✓

nccl-tests feat/cue-nu*
❯ rm gpus/rtxp6000_8x.cue

nccl-tests feat/cue-nu*
❯ ./nccl.nu sync --kubeconfig=$HOME/Downloads/kubeconfig-bd
Fetching GPU nodes from cluster...
Reading GPU configs from CUE...

╭───┬──────────────────┬─────────┬───────────╮
│ # │     GPU Type     │ Cluster │  Config   │
├───┼──────────────────┼─────────┼───────────┤
│ 0 │ A40              │ ✗       │ ✓ (a40)   │
│ 1 │ gb200-4x         │ ✗       │ ✓ (gb200) │
│ 2 │ gd-8xa100-i128   │ ✗       │ ✓ (a100)  │
│ 3 │ gd-8xh100ib-i128 │ ✗       │ ✓ (h100)  │
│ 4 │ rtxp6000-8x      │ ✓       │ ✗         │
╰───┴──────────────────┴─────────┴───────────╯


1 GPU type(s) in cluster without config:

Create config for rtxp6000-8x? [Y/n] y
Created gpus/rtxp6000_8x.cue

@tammersaleh
Copy link

I have basically zero experience in this codebase, but is this the first time we're introducing both CUE and nushell?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants