-
Notifications
You must be signed in to change notification settings - Fork 318
Description
I have a GCE ingress in front of an HPA-managed deployment (at this time, with a single replica).
On a rolling deployment, I sometimes run into the backend being marked as unhealthy and resulting 502 errors, usually for about 15-20 seconds.
According to the pod events, the neg-readiness-reflector appears to mark cloud.google.com/load-balancer-neg-ready to True for the pod before it is actually ready:
Normal LoadBalancerNegNotReady 18m neg-readiness-reflector Waiting for pod to become healthy in at least one of the NEG(s): [k8s1-600f13cf-default-my-svc-8080-f82bf741]
Normal LoadBalancerNegWithoutHealthCheck 16m neg-readiness-reflector Pod is in NEG "Key{\"k8s1-600f13cf-default-my-svc-8080-f82bf741\", zone: \"europe-west1-c\"}". NEG is not attached to any Backend Service with health checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.
Warning Unhealthy 16m kubelet Readiness probe failed: Get "http://10.129.128.130:8080/healthz": dial tcp 10.129.128.130:8080: connect: connection refused
While in this state, the previous pod terminates, but the load balancer does not route requests to the new pod, resulting in 502s.
I do have the deployment strategy set that should not allow this but I guess the neg being set to Ready is subverting this:
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
My deployment does also define a readiness probe as can be seen in the events above.
I do also have a health check configuration defined for the backend:
apiVersion: v1
kind: Service
metadata:
name: my-svc
labels:
app.kubernetes.io/name: mysvc
annotations:
cloud.google.com/backend-config: '{"ports": {"8080":"my-backendconfig"}}'
cloud.google.com/neg: '{"ingress": true}'
spec:
type: ClusterIP
selector:
app.kubernetes.io/name: mysvc
ports:
- port: 8080
protocol: TCP
targetPort: 8080
---
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: my-backendconfig
spec:
timeoutSec: 45
connectionDraining:
drainingTimeoutSec: 0
healthCheck:
checkIntervalSec: 5
timeoutSec: 5
healthyThreshold: 1
unhealthyThreshold: 2
type: HTTP
requestPath: /healthz
port: 8080
I found this stackoverflow in which the user works around the issue with delaying the pod stop with a sleep on the lifecycle.preStop, but that seems more like a hack than a proper solution to this issue: https://stackoverflow.com/questions/71127572/neg-is-not-attached-to-any-backendservice-with-health-checking.