DeviceHealthCheck: Design Proposal for Taint Key/Value Schema

### KEP Constraints

- Recommendation is to use the driver name as the domain in the key to avoid conflicts. Not a strict requirement though.
- Values should not require parsing. It's better to use different keys with simple values than one key with a complex value.
- A single device can have at most 16 taints.
- Drivers may publish “informational” taints (often Effect=None) to report degradation without affecting scheduling; admins/controllers can then apply NoSchedule/NoExecute via DeviceTaintRule.

**DeviceTaint API**
```go
type DeviceTaint struct {
    Key       string            // must be a label name
    Value     string            // must be a label value (optional)
    Effect    DeviceTaintEffect // None | NoSchedule | NoExecute
    TimeAdded *metav1.Time      // timestamp for when taint was added (optional)
}
```

* **Toleration matching is exact on key** (not prefix-based). 
* `key: ""` with `operator: Exists` matches **all** keys.

### Current Health Events (NVML)

| Event | Current Behavior | Scope |
| :--- | :--- | :--- |
| **XID critical error (non-skipped)** | Mark device unhealthy | Single device |
| **ERROR_GPU_IS_LOST** | Mark all devices unhealthy | All devices |
| **UUID lookup failure** | Mark all devices unhealthy | All devices |
| **Device handle / registration failure at startup** | Mark parent + MIG children unhealthy | Per-GPU |

---

### Option A: One key per health dimension (KEP-aligned)

Each distinct failure gets its own key under the `gpu.nvidia.com` domain. Values are simple descriptors, not encoded data.

```yaml
gpu.nvidia.com/xid:        "48"  (Effect: None)
gpu.nvidia.com/gpu-lost:   ""    (Effect: None)
gpu.nvidia.com/unmonitored: ""   (Effect: None)
```

Future sources add new keys:
```yaml
gpu.nvidia.com/memory:     "retired"   (Effect: None)
gpu.nvidia.com/nvlink:     "degraded"  (Effect: None)
gpu.nvidia.com/thermal:    "throttled" (Effect: None)
```

**Toleration Examples:**
```yaml
# Tolerate all driver-set health taints (wildcard)
- key: ""
  operator: Exists

# Tolerate XID errors specifically
- key: "gpu.nvidia.com/xid"
  operator: Exists

# Tolerate only a specific XID
- key: "gpu.nvidia.com/xid"
  operator: Equal
  value: "48"
```

| Pros | Cons |
| :--- | :--- |
| Directly follows KEP guidance: "different keys with simple values" | "Tolerate all health taints" requires `key: ""` wildcard, which also matches admin-set `DeviceTaintRule` taints |
| Each key is self-documenting | More keys to document |
| Values are simple, no parsing needed | |
| Each health dimension evolves independently | |
| Follows recommended `<domain>/<descriptive-name>` pattern | |

---

### Option B: Single key, categorized value

One key for all driver-set health taints. The value encodes the failure category.

```yaml
gpu.nvidia.com/health: "xid"         (Effect: None)
gpu.nvidia.com/health: "gpu-lost"    (Effect: None)
gpu.nvidia.com/health: "unmonitored" (Effect: None)
```

**Toleration Examples:**
```yaml
# Tolerate all health taints (one key)
- key: "gpu.nvidia.com/health"
  operator: Exists

# Tolerate XID errors only
- key: https://gpu.nvidia.com/health"
  operator: Equal
  value: "xid"
```

| Pros | Cons |
| :--- | :--- |
| `Exists` on one key tolerates all health issues without matching admin taints | Goes against KEP guidance ("don't use one key with complex value") |
| Simplest initial implementation | Value vocabulary grows as sources are added; boundary between "xid" and "memory" can blur |
| Uses only 1 taint slot per category | If a device has both "xid" and "gpu-lost", it needs two taints with the same key (valid per KEP, but unusual) |
| | Embedding XID number in value (e.g., `xid-48`) requires parsing — violates KEP guidance |

---

### Option C: Hybrid — health prefix + dimension suffix

Use `gpu.nvidia.com/health` as a conceptual namespace, but split into separate keys per dimension.

```yaml
gpu.nvidia.com/health.xid:         "48" (Effect: None)
gpu.nvidia.com/health.gpu-lost:    ""   (Effect: None)
gpu.nvidia.com/health.unmonitored: ""   (Effect: None)
```

| Pros | Cons |
| :--- | :--- |
| Visually groups all health-related taints | Key matching is exact — `key: "gpu.nvidia.com/health"` does NOT match `gpu.nvidia.com/health.xid` |
| | The grouping is cosmetic, not functional. No toleration can target the "health" prefix |
| | Essentially Option A with longer key names and a false sense of hierarchy |

---

### Recommendation: Option A

Option A is the best fit for three reasons:

1. **It's what the KEP recommends.** The guidance is explicit: "different keys with simple values." Following the upstream convention means we don't have to justify deviations and we align with how the broader ecosystem will evolve.
2. **Most future proof** Easily extensible to new health sources by just adding new keys — existing tolerations remain valid.
3. **The “tolerate all” concern is also solvable** with this Option.While in option B, key: "gpu.nvidia.com/health" with Exists gives a convenient wildcard, the KEP also provides `key: ""` + `operator: Exists` which matches all taints. If users want to tolerate only GPU health taints, they can explicitly list the relevant keys. That’s actually safer, because it makes them intentional about what conditions they’re tolerating.
---

### Effect Strategy

Per the KEP's "Degraded Devices" user story, the recommended pattern is:
* **Driver** sets `Effect: None` — informational, no scheduling or eviction impact.
* **Admin/controller** escalates via `DeviceTaintRule` to `NoSchedule` or `NoExecute`.

This separates health reporting (driver's job) from scheduling policy (admin's job). However, for `gpu-lost` events where the GPU is physically gone, waiting for admin action is impractical. 

**Proposed default effects:**

| Key | Default Effect | Rationale |
| :--- | :--- | :--- |
| `gpu.nvidia.com/xid` | `None` | Informational. Many XIDs are recoverable. Admin escalates if needed. |
| `gpu.nvidia.com/gpu-lost` | `NoExecute` | **Immediate Eviction.** The GPU is physically gone. Running workloads will hang/crash, so we must evict them immediately rather than just blocking new schedules. |
| `gpu.nvidia.com/unmonitored` | `None` | GPU may be healthy — we just can't watch it. Informational only. |

> Open question: Should we do `NoExecute` for `gpu-lost` to ensure immediate eviction of hanging pods, or choose `NoSchedule` and force admins to escalate via `DeviceTaintRule`?

---

### Taint Accumulation

With Option A, accumulation is natural — each key represents a different condition and they coexist independently. Within the same key, we apply a severity logic: `None` < `NoSchedule` < `NoExecute`. 

### XID Value Encoding

With Option A, the value on `gpu.nvidia.com/xid` can be the XID number as a string (e.g., `"48"`). This is simple, requires no parsing for equality matching, and lets users write tolerations for specific XIDs:

```yaml
- key: "gpu.nvidia.com/xid"
  operator: Equal
  value: "48"
```

If a device experiences multiple different XIDs, each subsequent XID would update the value (keeping the highest severity effect). Only the most recent XID is visible in the value. If we need to preserve full XID history, that belongs in events, not in the taint.


Key	Default Effect	Rationale
`gpu.nvidia.com/xid`	`None`	Informational. Many XIDs are recoverable. Admin escalates if needed.
`gpu.nvidia.com/gpu-lost`	`NoExecute`	Immediate Eviction. The GPU is physically gone. Running workloads will hang/crash, so we must evict them immediately rather than just blocking new schedules.
`gpu.nvidia.com/unmonitored`	`None`	GPU may be healthy — we just can't watch it. Informational only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeviceHealthCheck: Design Proposal for Taint Key/Value Schema #905

KEP Constraints

Current Health Events (NVML)

Option A: One key per health dimension (KEP-aligned)

Option B: Single key, categorized value

Option C: Hybrid — health prefix + dimension suffix

Recommendation: Option A

Effect Strategy

Taint Accumulation

XID Value Encoding

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Event	Current Behavior	Scope
XID critical error (non-skipped)	Mark device unhealthy	Single device
ERROR_GPU_IS_LOST	Mark all devices unhealthy	All devices
UUID lookup failure	Mark all devices unhealthy	All devices
Device handle / registration failure at startup	Mark parent + MIG children unhealthy	Per-GPU

Pros	Cons
Directly follows KEP guidance: "different keys with simple values"	"Tolerate all health taints" requires `key: ""` wildcard, which also matches admin-set `DeviceTaintRule` taints
Each key is self-documenting	More keys to document
Values are simple, no parsing needed
Each health dimension evolves independently
Follows recommended `<domain>/<descriptive-name>` pattern

Pros	Cons
`Exists` on one key tolerates all health issues without matching admin taints	Goes against KEP guidance ("don't use one key with complex value")
Simplest initial implementation	Value vocabulary grows as sources are added; boundary between "xid" and "memory" can blur
Uses only 1 taint slot per category	If a device has both "xid" and "gpu-lost", it needs two taints with the same key (valid per KEP, but unusual)
	Embedding XID number in value (e.g., `xid-48`) requires parsing — violates KEP guidance

Pros	Cons
Visually groups all health-related taints	Key matching is exact — `key: "gpu.nvidia.com/health"` does NOT match `gpu.nvidia.com/health.xid`
	The grouping is cosmetic, not functional. No toleration can target the "health" prefix
	Essentially Option A with longer key names and a false sense of hierarchy

DeviceHealthCheck: Design Proposal for Taint Key/Value Schema #905

Description

KEP Constraints

Current Health Events (NVML)

Option A: One key per health dimension (KEP-aligned)

Option B: Single key, categorized value

Option C: Hybrid — health prefix + dimension suffix

Recommendation: Option A

Effect Strategy

Taint Accumulation

XID Value Encoding

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions