feat: add application-level keepalive to prevent ALB idle connection timeouts#75
feat: add application-level keepalive to prevent ALB idle connection timeouts#75
Conversation
…timeouts AWS ALBs do not forward HTTP/2 PING frames, causing idle gRPC connections to be closed. This adds a background loop that periodically calls the existing Hello RPC as application-level traffic to keep the connection alive through L7 load balancers. Signed-off-by: joshvanl <[email protected]>
There was a problem hiding this comment.
Pull request overview
Adds an application-level keepalive mechanism to the gRPC worker client to mitigate AWS ALB idle timeout behavior (ALB not forwarding HTTP/2 PING frames), helping long-lived idle connections stay open.
Changes:
- Introduces a background keepalive loop that periodically calls the existing
HelloRPC. - Starts/stops the keepalive loop alongside the background work-item processor lifecycle.
- Restarts the keepalive loop after successfully reconnecting the work-item stream.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| case <-ctx.Done(): | ||
| return | ||
| case <-ticker.C: | ||
| _, err := c.client.Hello(ctx, &emptypb.Empty{}) |
There was a problem hiding this comment.
The keepalive RPC uses the long-lived listener context directly. If the Hello call gets stuck (e.g., transport stalls without returning an error), this goroutine can block indefinitely and stop sending keepalive traffic. Consider wrapping each Hello call in a short per-request timeout context (derived from the loop ctx) so failures unblock reliably and the keepalive cadence remains consistent.
| _, err := c.client.Hello(ctx, &emptypb.Empty{}) | |
| reqCtx, reqCancel := context.WithTimeout(ctx, 5*time.Second) | |
| _, err := c.client.Hello(reqCtx, &emptypb.Empty{}) | |
| reqCancel() |
| if s, ok := status.FromError(err); ok && s.Code() == codes.Unimplemented { | ||
| c.logger.Info("server does not support keepalive Hello RPC; disabling application-level keepalive") | ||
| return | ||
| } |
There was a problem hiding this comment.
When the keepalive loop is stopped via cancelKeepalive() (e.g., during reconnect or shutdown), the in-flight Hello call can return a context-canceled error and will currently be logged as "keepalive failed". Since cancellation is an expected path here, it would be better to suppress logging for ctx cancellation / DeadlineExceeded (or check ctx.Err() before logging) to avoid noisy logs during normal shutdown/reconnect cycles.
| if s, ok := status.FromError(err); ok && s.Code() == codes.Unimplemented { | |
| c.logger.Info("server does not support keepalive Hello RPC; disabling application-level keepalive") | |
| return | |
| } | |
| // Suppress expected cancellation/deadline errors during shutdown/reconnect. | |
| if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) { | |
| continue | |
| } | |
| if s, ok := status.FromError(err); ok { | |
| if s.Code() == codes.Canceled || s.Code() == codes.DeadlineExceeded { | |
| continue | |
| } | |
| if s.Code() == codes.Unimplemented { | |
| c.logger.Info("server does not support keepalive Hello RPC; disabling application-level keepalive") | |
| return | |
| } | |
| } |
| func (c *TaskHubGrpcClient) startKeepaliveLoop(ctx context.Context) context.CancelFunc { | ||
| ctx, cancel := context.WithCancel(ctx) | ||
| go func() { | ||
| ticker := time.NewTicker(30 * time.Second) |
There was a problem hiding this comment.
The keepalive interval is currently hard-coded to 30s. To make this easier to tune for different L7 load balancer idle timeouts (and easier to test without long sleeps), consider extracting this into a named constant or making it configurable via client options.
| func (c *TaskHubGrpcClient) startKeepaliveLoop(ctx context.Context) context.CancelFunc { | |
| ctx, cancel := context.WithCancel(ctx) | |
| go func() { | |
| ticker := time.NewTicker(30 * time.Second) | |
| const defaultKeepaliveInterval = 30 * time.Second | |
| func (c *TaskHubGrpcClient) startKeepaliveLoop(ctx context.Context) context.CancelFunc { | |
| ctx, cancel := context.WithCancel(ctx) | |
| go func() { | |
| ticker := time.NewTicker(defaultKeepaliveInterval) |
AWS ALBs do not forward HTTP/2 PING frames, causing idle gRPC connections to be closed. This adds a background loop that periodically calls the existing Hello RPC as application-level traffic to keep the connection alive through L7 load balancers.