Skip to content

DeviceHealthCheck: Design Proposal for Taint Key/Value Schema #905

@guptaNswati

Description

@guptaNswati

KEP Constraints

  • Recommendation is to use the driver name as the domain in the key to avoid conflicts. Not a strict requirement though.
  • Values should not require parsing. It's better to use different keys with simple values than one key with a complex value.
  • A single device can have at most 16 taints.
  • Drivers may publish “informational” taints (often Effect=None) to report degradation without affecting scheduling; admins/controllers can then apply NoSchedule/NoExecute via DeviceTaintRule.

DeviceTaint API

type DeviceTaint struct {
    Key       string            // must be a label name
    Value     string            // must be a label value (optional)
    Effect    DeviceTaintEffect // None | NoSchedule | NoExecute
    TimeAdded *metav1.Time      // timestamp for when taint was added (optional)
}
  • Toleration matching is exact on key (not prefix-based).
  • key: "" with operator: Exists matches all keys.

Current Health Events (NVML)

Event Current Behavior Scope
XID critical error (non-skipped) Mark device unhealthy Single device
ERROR_GPU_IS_LOST Mark all devices unhealthy All devices
UUID lookup failure Mark all devices unhealthy All devices
Device handle / registration failure at startup Mark parent + MIG children unhealthy Per-GPU

Option A: One key per health dimension (KEP-aligned)

Each distinct failure gets its own key under the gpu.nvidia.com domain. Values are simple descriptors, not encoded data.

gpu.nvidia.com/xid:        "48"  (Effect: None)
gpu.nvidia.com/gpu-lost:   ""    (Effect: None)
gpu.nvidia.com/unmonitored: ""   (Effect: None)

Future sources add new keys:

gpu.nvidia.com/memory:     "retired"   (Effect: None)
gpu.nvidia.com/nvlink:     "degraded"  (Effect: None)
gpu.nvidia.com/thermal:    "throttled" (Effect: None)

Toleration Examples:

# Tolerate all driver-set health taints (wildcard)
- key: ""
  operator: Exists

# Tolerate XID errors specifically
- key: "gpu.nvidia.com/xid"
  operator: Exists

# Tolerate only a specific XID
- key: "gpu.nvidia.com/xid"
  operator: Equal
  value: "48"
Pros Cons
Directly follows KEP guidance: "different keys with simple values" "Tolerate all health taints" requires key: "" wildcard, which also matches admin-set DeviceTaintRule taints
Each key is self-documenting More keys to document
Values are simple, no parsing needed
Each health dimension evolves independently
Follows recommended <domain>/<descriptive-name> pattern

Option B: Single key, categorized value

One key for all driver-set health taints. The value encodes the failure category.

gpu.nvidia.com/health: "xid"         (Effect: None)
gpu.nvidia.com/health: "gpu-lost"    (Effect: None)
gpu.nvidia.com/health: "unmonitored" (Effect: None)

Toleration Examples:

# Tolerate all health taints (one key)
- key: "gpu.nvidia.com/health"
  operator: Exists

# Tolerate XID errors only
- key: https://gpu.nvidia.com/health"
  operator: Equal
  value: "xid"
Pros Cons
Exists on one key tolerates all health issues without matching admin taints Goes against KEP guidance ("don't use one key with complex value")
Simplest initial implementation Value vocabulary grows as sources are added; boundary between "xid" and "memory" can blur
Uses only 1 taint slot per category If a device has both "xid" and "gpu-lost", it needs two taints with the same key (valid per KEP, but unusual)
Embedding XID number in value (e.g., xid-48) requires parsing — violates KEP guidance

Option C: Hybrid — health prefix + dimension suffix

Use gpu.nvidia.com/health as a conceptual namespace, but split into separate keys per dimension.

gpu.nvidia.com/health.xid:         "48" (Effect: None)
gpu.nvidia.com/health.gpu-lost:    ""   (Effect: None)
gpu.nvidia.com/health.unmonitored: ""   (Effect: None)
Pros Cons
Visually groups all health-related taints Key matching is exact — key: "gpu.nvidia.com/health" does NOT match gpu.nvidia.com/health.xid
The grouping is cosmetic, not functional. No toleration can target the "health" prefix
Essentially Option A with longer key names and a false sense of hierarchy

Recommendation: Option A

Option A is the best fit for three reasons:

  1. It's what the KEP recommends. The guidance is explicit: "different keys with simple values." Following the upstream convention means we don't have to justify deviations and we align with how the broader ecosystem will evolve.
  2. Most future proof Easily extensible to new health sources by just adding new keys — existing tolerations remain valid.
  3. The “tolerate all” concern is also solvable with this Option.While in option B, key: "gpu.nvidia.com/health" with Exists gives a convenient wildcard, the KEP also provides key: "" + operator: Exists which matches all taints. If users want to tolerate only GPU health taints, they can explicitly list the relevant keys. That’s actually safer, because it makes them intentional about what conditions they’re tolerating.

Effect Strategy

Per the KEP's "Degraded Devices" user story, the recommended pattern is:

  • Driver sets Effect: None — informational, no scheduling or eviction impact.
  • Admin/controller escalates via DeviceTaintRule to NoSchedule or NoExecute.

This separates health reporting (driver's job) from scheduling policy (admin's job). However, for gpu-lost events where the GPU is physically gone, waiting for admin action is impractical.

Proposed default effects:

Key Default Effect Rationale
gpu.nvidia.com/xid None Informational. Many XIDs are recoverable. Admin escalates if needed.
gpu.nvidia.com/gpu-lost NoExecute Immediate Eviction. The GPU is physically gone. Running workloads will hang/crash, so we must evict them immediately rather than just blocking new schedules.
gpu.nvidia.com/unmonitored None GPU may be healthy — we just can't watch it. Informational only.

Open question: Should we do NoExecute for gpu-lost to ensure immediate eviction of hanging pods, or choose NoSchedule and force admins to escalate via DeviceTaintRule?


Taint Accumulation

With Option A, accumulation is natural — each key represents a different condition and they coexist independently. Within the same key, we apply a severity logic: None < NoSchedule < NoExecute.

XID Value Encoding

With Option A, the value on gpu.nvidia.com/xid can be the XID number as a string (e.g., "48"). This is simple, requires no parsing for equality matching, and lets users write tolerations for specific XIDs:

- key: "gpu.nvidia.com/xid"
  operator: Equal
  value: "48"

If a device experiences multiple different XIDs, each subsequent XID would update the value (keeping the highest severity effect). Only the most recent XID is visible in the value. If we need to preserve full XID history, that belongs in events, not in the taint.

Metadata

Metadata

Assignees

Labels

featureissue/PR that proposes a new feature or functionality

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions