Skip to content

Unable to avoid unhealthy backend / 502s on rolling deploymentsΒ #1718

@rocketraman

Description

@rocketraman

I have a GCE ingress in front of an HPA-managed deployment (at this time, with a single replica).

On a rolling deployment, I sometimes run into the backend being marked as unhealthy and resulting 502 errors, usually for about 15-20 seconds.

According to the pod events, the neg-readiness-reflector appears to mark cloud.google.com/load-balancer-neg-ready to True for the pod before it is actually ready:

Normal   LoadBalancerNegNotReady            18m                neg-readiness-reflector                Waiting for pod to become healthy in at least one of the NEG(s): [k8s1-600f13cf-default-my-svc-8080-f82bf741]
Normal   LoadBalancerNegWithoutHealthCheck  16m                neg-readiness-reflector                Pod is in NEG "Key{\"k8s1-600f13cf-default-my-svc-8080-f82bf741\", zone: \"europe-west1-c\"}". NEG is not attached to any Backend Service with health checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.
Warning  Unhealthy                          16m                kubelet                                Readiness probe failed: Get "http://10.129.128.130:8080/healthz": dial tcp 10.129.128.130:8080: connect: connection refused

While in this state, the previous pod terminates, but the load balancer does not route requests to the new pod, resulting in 502s.

I do have the deployment strategy set that should not allow this but I guess the neg being set to Ready is subverting this:

  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

My deployment does also define a readiness probe as can be seen in the events above.

I do also have a health check configuration defined for the backend:

apiVersion: v1
kind: Service
metadata:
  name: my-svc
  labels:
    app.kubernetes.io/name: mysvc
  annotations:
    cloud.google.com/backend-config: '{"ports": {"8080":"my-backendconfig"}}'
    cloud.google.com/neg: '{"ingress": true}'
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: mysvc
  ports:
    - port: 8080
      protocol: TCP
      targetPort: 8080
---
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: my-backendconfig
spec:
  timeoutSec: 45
  connectionDraining:
    drainingTimeoutSec: 0
  healthCheck:
    checkIntervalSec: 5
    timeoutSec: 5
    healthyThreshold: 1
    unhealthyThreshold: 2
    type: HTTP
    requestPath: /healthz
    port: 8080

I found this stackoverflow in which the user works around the issue with delaying the pod stop with a sleep on the lifecycle.preStop, but that seems more like a hack than a proper solution to this issue: https://stackoverflow.com/questions/71127572/neg-is-not-attached-to-any-backendservice-with-health-checking.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/supportCategorizes issue or PR as a support question.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions