Skip to content

feat: add application-level keepalive to prevent ALB idle connection timeouts#75

Closed
JoshVanL wants to merge 1 commit intodapr:mainfrom
JoshVanL:grpc-keepalive-hello
Closed

feat: add application-level keepalive to prevent ALB idle connection timeouts#75
JoshVanL wants to merge 1 commit intodapr:mainfrom
JoshVanL:grpc-keepalive-hello

Conversation

@JoshVanL
Copy link

AWS ALBs do not forward HTTP/2 PING frames, causing idle gRPC connections to be closed. This adds a background loop that periodically calls the existing Hello RPC as application-level traffic to keep the connection alive through L7 load balancers.

…timeouts

AWS ALBs do not forward HTTP/2 PING frames, causing idle gRPC connections
to be closed. This adds a background loop that periodically calls the
existing Hello RPC as application-level traffic to keep the connection
alive through L7 load balancers.

Signed-off-by: joshvanl <[email protected]>
@JoshVanL JoshVanL marked this pull request as ready for review March 11, 2026 17:51
@JoshVanL JoshVanL requested a review from a team as a code owner March 11, 2026 17:51
Copilot AI review requested due to automatic review settings March 11, 2026 17:51
@JoshVanL JoshVanL closed this Mar 11, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an application-level keepalive mechanism to the gRPC worker client to mitigate AWS ALB idle timeout behavior (ALB not forwarding HTTP/2 PING frames), helping long-lived idle connections stay open.

Changes:

  • Introduces a background keepalive loop that periodically calls the existing Hello RPC.
  • Starts/stops the keepalive loop alongside the background work-item processor lifecycle.
  • Restarts the keepalive loop after successfully reconnecting the work-item stream.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

case <-ctx.Done():
return
case <-ticker.C:
_, err := c.client.Hello(ctx, &emptypb.Empty{})
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keepalive RPC uses the long-lived listener context directly. If the Hello call gets stuck (e.g., transport stalls without returning an error), this goroutine can block indefinitely and stop sending keepalive traffic. Consider wrapping each Hello call in a short per-request timeout context (derived from the loop ctx) so failures unblock reliably and the keepalive cadence remains consistent.

Suggested change
_, err := c.client.Hello(ctx, &emptypb.Empty{})
reqCtx, reqCancel := context.WithTimeout(ctx, 5*time.Second)
_, err := c.client.Hello(reqCtx, &emptypb.Empty{})
reqCancel()

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +43
if s, ok := status.FromError(err); ok && s.Code() == codes.Unimplemented {
c.logger.Info("server does not support keepalive Hello RPC; disabling application-level keepalive")
return
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the keepalive loop is stopped via cancelKeepalive() (e.g., during reconnect or shutdown), the in-flight Hello call can return a context-canceled error and will currently be logged as "keepalive failed". Since cancellation is an expected path here, it would be better to suppress logging for ctx cancellation / DeadlineExceeded (or check ctx.Err() before logging) to avoid noisy logs during normal shutdown/reconnect cycles.

Suggested change
if s, ok := status.FromError(err); ok && s.Code() == codes.Unimplemented {
c.logger.Info("server does not support keepalive Hello RPC; disabling application-level keepalive")
return
}
// Suppress expected cancellation/deadline errors during shutdown/reconnect.
if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
continue
}
if s, ok := status.FromError(err); ok {
if s.Code() == codes.Canceled || s.Code() == codes.DeadlineExceeded {
continue
}
if s.Code() == codes.Unimplemented {
c.logger.Info("server does not support keepalive Hello RPC; disabling application-level keepalive")
return
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +31
func (c *TaskHubGrpcClient) startKeepaliveLoop(ctx context.Context) context.CancelFunc {
ctx, cancel := context.WithCancel(ctx)
go func() {
ticker := time.NewTicker(30 * time.Second)
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keepalive interval is currently hard-coded to 30s. To make this easier to tune for different L7 load balancer idle timeouts (and easier to test without long sleeps), consider extracting this into a named constant or making it configurable via client options.

Suggested change
func (c *TaskHubGrpcClient) startKeepaliveLoop(ctx context.Context) context.CancelFunc {
ctx, cancel := context.WithCancel(ctx)
go func() {
ticker := time.NewTicker(30 * time.Second)
const defaultKeepaliveInterval = 30 * time.Second
func (c *TaskHubGrpcClient) startKeepaliveLoop(ctx context.Context) context.CancelFunc {
ctx, cancel := context.WithCancel(ctx)
go func() {
ticker := time.NewTicker(defaultKeepaliveInterval)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants