-
Notifications
You must be signed in to change notification settings - Fork 124
Description
KEP Constraints
- Recommendation is to use the driver name as the domain in the key to avoid conflicts. Not a strict requirement though.
- Values should not require parsing. It's better to use different keys with simple values than one key with a complex value.
- A single device can have at most 16 taints.
- Drivers may publish “informational” taints (often Effect=None) to report degradation without affecting scheduling; admins/controllers can then apply NoSchedule/NoExecute via DeviceTaintRule.
DeviceTaint API
type DeviceTaint struct {
Key string // must be a label name
Value string // must be a label value (optional)
Effect DeviceTaintEffect // None | NoSchedule | NoExecute
TimeAdded *metav1.Time // timestamp for when taint was added (optional)
}- Toleration matching is exact on key (not prefix-based).
key: ""withoperator: Existsmatches all keys.
Current Health Events (NVML)
| Event | Current Behavior | Scope |
|---|---|---|
| XID critical error (non-skipped) | Mark device unhealthy | Single device |
| ERROR_GPU_IS_LOST | Mark all devices unhealthy | All devices |
| UUID lookup failure | Mark all devices unhealthy | All devices |
| Device handle / registration failure at startup | Mark parent + MIG children unhealthy | Per-GPU |
Option A: One key per health dimension (KEP-aligned)
Each distinct failure gets its own key under the gpu.nvidia.com domain. Values are simple descriptors, not encoded data.
gpu.nvidia.com/xid: "48" (Effect: None)
gpu.nvidia.com/gpu-lost: "" (Effect: None)
gpu.nvidia.com/unmonitored: "" (Effect: None)Future sources add new keys:
gpu.nvidia.com/memory: "retired" (Effect: None)
gpu.nvidia.com/nvlink: "degraded" (Effect: None)
gpu.nvidia.com/thermal: "throttled" (Effect: None)Toleration Examples:
# Tolerate all driver-set health taints (wildcard)
- key: ""
operator: Exists
# Tolerate XID errors specifically
- key: "gpu.nvidia.com/xid"
operator: Exists
# Tolerate only a specific XID
- key: "gpu.nvidia.com/xid"
operator: Equal
value: "48"| Pros | Cons |
|---|---|
| Directly follows KEP guidance: "different keys with simple values" | "Tolerate all health taints" requires key: "" wildcard, which also matches admin-set DeviceTaintRule taints |
| Each key is self-documenting | More keys to document |
| Values are simple, no parsing needed | |
| Each health dimension evolves independently | |
Follows recommended <domain>/<descriptive-name> pattern |
Option B: Single key, categorized value
One key for all driver-set health taints. The value encodes the failure category.
gpu.nvidia.com/health: "xid" (Effect: None)
gpu.nvidia.com/health: "gpu-lost" (Effect: None)
gpu.nvidia.com/health: "unmonitored" (Effect: None)Toleration Examples:
# Tolerate all health taints (one key)
- key: "gpu.nvidia.com/health"
operator: Exists
# Tolerate XID errors only
- key: https://gpu.nvidia.com/health"
operator: Equal
value: "xid"| Pros | Cons |
|---|---|
Exists on one key tolerates all health issues without matching admin taints |
Goes against KEP guidance ("don't use one key with complex value") |
| Simplest initial implementation | Value vocabulary grows as sources are added; boundary between "xid" and "memory" can blur |
| Uses only 1 taint slot per category | If a device has both "xid" and "gpu-lost", it needs two taints with the same key (valid per KEP, but unusual) |
Embedding XID number in value (e.g., xid-48) requires parsing — violates KEP guidance |
Option C: Hybrid — health prefix + dimension suffix
Use gpu.nvidia.com/health as a conceptual namespace, but split into separate keys per dimension.
gpu.nvidia.com/health.xid: "48" (Effect: None)
gpu.nvidia.com/health.gpu-lost: "" (Effect: None)
gpu.nvidia.com/health.unmonitored: "" (Effect: None)| Pros | Cons |
|---|---|
| Visually groups all health-related taints | Key matching is exact — key: "gpu.nvidia.com/health" does NOT match gpu.nvidia.com/health.xid |
| The grouping is cosmetic, not functional. No toleration can target the "health" prefix | |
| Essentially Option A with longer key names and a false sense of hierarchy |
Recommendation: Option A
Option A is the best fit for three reasons:
- It's what the KEP recommends. The guidance is explicit: "different keys with simple values." Following the upstream convention means we don't have to justify deviations and we align with how the broader ecosystem will evolve.
- Most future proof Easily extensible to new health sources by just adding new keys — existing tolerations remain valid.
- The “tolerate all” concern is also solvable with this Option.While in option B, key: "gpu.nvidia.com/health" with Exists gives a convenient wildcard, the KEP also provides
key: ""+operator: Existswhich matches all taints. If users want to tolerate only GPU health taints, they can explicitly list the relevant keys. That’s actually safer, because it makes them intentional about what conditions they’re tolerating.
Effect Strategy
Per the KEP's "Degraded Devices" user story, the recommended pattern is:
- Driver sets
Effect: None— informational, no scheduling or eviction impact. - Admin/controller escalates via
DeviceTaintRuletoNoScheduleorNoExecute.
This separates health reporting (driver's job) from scheduling policy (admin's job). However, for gpu-lost events where the GPU is physically gone, waiting for admin action is impractical.
Proposed default effects:
| Key | Default Effect | Rationale |
|---|---|---|
gpu.nvidia.com/xid |
None |
Informational. Many XIDs are recoverable. Admin escalates if needed. |
gpu.nvidia.com/gpu-lost |
NoExecute |
Immediate Eviction. The GPU is physically gone. Running workloads will hang/crash, so we must evict them immediately rather than just blocking new schedules. |
gpu.nvidia.com/unmonitored |
None |
GPU may be healthy — we just can't watch it. Informational only. |
Open question: Should we do
NoExecuteforgpu-lostto ensure immediate eviction of hanging pods, or chooseNoScheduleand force admins to escalate viaDeviceTaintRule?
Taint Accumulation
With Option A, accumulation is natural — each key represents a different condition and they coexist independently. Within the same key, we apply a severity logic: None < NoSchedule < NoExecute.
XID Value Encoding
With Option A, the value on gpu.nvidia.com/xid can be the XID number as a string (e.g., "48"). This is simple, requires no parsing for equality matching, and lets users write tolerations for specific XIDs:
- key: "gpu.nvidia.com/xid"
operator: Equal
value: "48"If a device experiences multiple different XIDs, each subsequent XID would update the value (keeping the highest severity effect). Only the most recent XID is visible in the value. If we need to preserve full XID history, that belongs in events, not in the taint.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status