feat: add application-level keepalive to prevent ALB idle connection timeouts by JoshVanL · Pull Request #75 · dapr/durabletask-go

JoshVanL · 2026-03-11T17:51:13Z

AWS ALBs do not forward HTTP/2 PING frames, causing idle gRPC connections to be closed. This adds a background loop that periodically calls the existing Hello RPC as application-level traffic to keep the connection alive through L7 load balancers.

…timeouts AWS ALBs do not forward HTTP/2 PING frames, causing idle gRPC connections to be closed. This adds a background loop that periodically calls the existing Hello RPC as application-level traffic to keep the connection alive through L7 load balancers. Signed-off-by: joshvanl <[email protected]>

Copilot

Pull request overview

Adds an application-level keepalive mechanism to the gRPC worker client to mitigate AWS ALB idle timeout behavior (ALB not forwarding HTTP/2 PING frames), helping long-lived idle connections stay open.

Changes:

Introduces a background keepalive loop that periodically calls the existing Hello RPC.
Starts/stops the keepalive loop alongside the background work-item processor lifecycle.
Restarts the keepalive loop after successfully reconnecting the work-item stream.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-11T17:55:44Z

client/worker_grpc.go

+			case <-ctx.Done():
+				return
+			case <-ticker.C:
+				_, err := c.client.Hello(ctx, &emptypb.Empty{})


The keepalive RPC uses the long-lived listener context directly. If the Hello call gets stuck (e.g., transport stalls without returning an error), this goroutine can block indefinitely and stop sending keepalive traffic. Consider wrapping each Hello call in a short per-request timeout context (derived from the loop ctx) so failures unblock reliably and the keepalive cadence remains consistent.

Suggested change

_, err := c.client.Hello(ctx, &emptypb.Empty{})

reqCtx, reqCancel := context.WithTimeout(ctx, 5*time.Second)

_, err := c.client.Hello(reqCtx, &emptypb.Empty{})

reqCancel()

Copilot · 2026-03-11T17:55:44Z

client/worker_grpc.go

+					if s, ok := status.FromError(err); ok && s.Code() == codes.Unimplemented {
+						c.logger.Info("server does not support keepalive Hello RPC; disabling application-level keepalive")
+						return
+					}


When the keepalive loop is stopped via cancelKeepalive() (e.g., during reconnect or shutdown), the in-flight Hello call can return a context-canceled error and will currently be logged as "keepalive failed". Since cancellation is an expected path here, it would be better to suppress logging for ctx cancellation / DeadlineExceeded (or check ctx.Err() before logging) to avoid noisy logs during normal shutdown/reconnect cycles.

Suggested change

if s, ok := status.FromError(err); ok && s.Code() == codes.Unimplemented {

c.logger.Info("server does not support keepalive Hello RPC; disabling application-level keepalive")

return

}

// Suppress expected cancellation/deadline errors during shutdown/reconnect.

if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {

continue

}

if s, ok := status.FromError(err); ok {

if s.Code() == codes.Canceled || s.Code() == codes.DeadlineExceeded {

continue

}

if s.Code() == codes.Unimplemented {

c.logger.Info("server does not support keepalive Hello RPC; disabling application-level keepalive")

return

}

}

Copilot · 2026-03-11T17:55:45Z

client/worker_grpc.go

+func (c *TaskHubGrpcClient) startKeepaliveLoop(ctx context.Context) context.CancelFunc {
+	ctx, cancel := context.WithCancel(ctx)
+	go func() {
+		ticker := time.NewTicker(30 * time.Second)


The keepalive interval is currently hard-coded to 30s. To make this easier to tune for different L7 load balancer idle timeouts (and easier to test without long sleeps), consider extracting this into a named constant or making it configurable via client options.

Suggested change

func (c *TaskHubGrpcClient) startKeepaliveLoop(ctx context.Context) context.CancelFunc {

ctx, cancel := context.WithCancel(ctx)

go func() {

ticker := time.NewTicker(30 * time.Second)

const defaultKeepaliveInterval = 30 * time.Second

func (c *TaskHubGrpcClient) startKeepaliveLoop(ctx context.Context) context.CancelFunc {

ctx, cancel := context.WithCancel(ctx)

go func() {

ticker := time.NewTicker(defaultKeepaliveInterval)

JoshVanL marked this pull request as ready for review March 11, 2026 17:51

JoshVanL requested a review from a team as a code owner March 11, 2026 17:51

Copilot AI review requested due to automatic review settings March 11, 2026 17:51

JoshVanL closed this Mar 11, 2026

Copilot started reviewing on behalf of JoshVanL March 11, 2026 17:51 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add application-level keepalive to prevent ALB idle connection timeouts#75

feat: add application-level keepalive to prevent ALB idle connection timeouts#75
JoshVanL wants to merge 1 commit intodapr:mainfrom
JoshVanL:grpc-keepalive-hello

JoshVanL commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-				_, err := c.client.Hello(ctx, &emptypb.Empty{})
+				reqCtx, reqCancel := context.WithTimeout(ctx, 5*time.Second)
+				_, err := c.client.Hello(reqCtx, &emptypb.Empty{})
+				reqCancel()

-					if s, ok := status.FromError(err); ok && s.Code() == codes.Unimplemented {
-						c.logger.Info("server does not support keepalive Hello RPC; disabling application-level keepalive")
-						return
-					}
+					// Suppress expected cancellation/deadline errors during shutdown/reconnect.
+					if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
+						continue
+					}
+					if s, ok := status.FromError(err); ok {
+						if s.Code() == codes.Canceled || s.Code() == codes.DeadlineExceeded {
+							continue
+						}
+						if s.Code() == codes.Unimplemented {
+							c.logger.Info("server does not support keepalive Hello RPC; disabling application-level keepalive")
+							return
+						}
+					}

Conversation

JoshVanL commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants