Skip to content

Karpenter does not delete broken nodes when there is no OS-level failure #2967

@wanddynosios

Description

@wanddynosios

Description

Observed Behavior:

We recently had a kubelet fail due to memory issues. The node (EKS) was correctly marked as NotReady by Kubernetes and all its pods transitioned to Terminating state. Replacement pods were scheduled, but pods with RWO volumes attached could not start on new nodes because the volumes were never released from the broken node.

My understanding is that because the underlying OS kept running, AWS did not detect an issue with the EC2 instance, so Karpenter never received a signal to delete the NodeClaim (which showed no indication that anything was wrong). The cluster remains stuck in this state indefinitely without manual intervention.

Expected Behavior:

Node-level health issues (e.g. NotReady status) should be propagated to the NodeClaim, so that broken nodes are terminated and their resources (volumes etc.) are released — even when there is no underlying hardware or OS failure.

Reproduction Steps:

  1. Create an EKS cluster managed by Karpenter
  2. Deploy workloads with RWO volumes bound to pods
  3. Simulate a kubelet failure on one of the nodes (e.g. systemctl stop kubelet)
  4. Observe that:
    • The node transitions to NotReady
    • Pods enter Terminating state but are never fully evicted
    • The NodeClaim is not deleted by Karpenter
    • RWO volumes remain bound to the broken node, blocking replacement pods

NodePool Configuration:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["m"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-cpu
          operator: Gt
          values: ["4"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      expireAfter: 720h
      startupTaints:
        - key: node.cilium.io/agent-not-ready
          value: "true"
          effect: NoExecute
  limits:
    cpu: 500
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 15m
    budgets:
      - nodes: "10%"
        reasons:
          - "Empty"
      - nodes: "10%"
        reasons:
          - "Drifted"
          - "Underutilized" #...

Note that we do not have terminationGracePeriod set, however I think that involuntary shutdowns should be handled separately from graceful termination. I am also aware of Karpenter's efforts regarding node repair, but there will always be errors that you cannot recover from, so I believe Karpenter should be able to handle those.

Versions:

  • Chart Version: 1.6.5
  • Kubernetes Version: v1.33.8-eks-f69f56f
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-prioritytriage/needs-informationIndicates an issue needs more information in order to work on it.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions