Skip to content

fix: ensure closed shards with pending data are still claimed#644

Closed
matus-tomlein wants to merge 2 commits intowarpstreamlabs:mainfrom
matus-tomlein:fix/kinesis-shard-cleanup-race-condition
Closed

fix: ensure closed shards with pending data are still claimed#644
matus-tomlein wants to merge 2 commits intowarpstreamlabs:mainfrom
matus-tomlein:fix/kinesis-shard-cleanup-race-condition

Conversation

@matus-tomlein
Copy link

@matus-tomlein matus-tomlein commented Jan 8, 2026

Summary

Fixes #643

Closed shards (with EndingSequenceNumber) were being skipped during rebalancing, even if they still had unprocessed records. This caused data loss when a consumer restarted before fully draining a closed shard.

The Problem

When a shard is closed, the old logic would skip it entirely during rebalancing:

for _, s := range allShards {
    if !isShardFinished(s) {  // ← Closed shards skipped!
        unclaimedShards[*s.ShardId] = ""
    }
}

This caused a race condition:

  1. Consumer is reading from Shard-0
  2. Kinesis scales down → Shard-0 gets CLOSED (EndingSequenceNumber set)
  3. Consumer hasn't finished draining Shard-0 yet (still has records)
  4. Rebalance runs:
    • isShardFinished(Shard-0) = true
    • Shard-0 is NOT added to unclaimedShards
    • No consumer gets spawned for it
  5. Consumer pod restarts (or crashes)
  6. Shard-0 is orphaned — nobody will ever finish consuming it

The Fix

Now we also check if a closed shard has a checkpoint in DynamoDB. If a checkpoint exists, the shard hasn't been fully consumed yet (since checkpointer.Delete() removes it upon completion), so we include it in unclaimedShards:

if !isShardFinished(s) || shardsWithCheckpoints[shardID] {
    unclaimedShards[shardID] = ""
}

Test plan

Added unit tests for isShardFinished() covering open, closed, and edge cases

Copilot AI review requested due to automatic review settings January 8, 2026 11:03
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a race condition where DynamoDB lease table entries for closed Kinesis shards are not cleaned up when Bento pods are terminated during Kubernetes scaling events, leading to orphaned entries and false positive latency alerts.

  • Adds conditional cleanup logic during pod shutdown to delete checkpoints for finished shards instead of saving them
  • Implements periodic background cleanup that removes DynamoDB entries for shards that have been closed by Kinesis

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@matus-tomlein matus-tomlein force-pushed the fix/kinesis-shard-cleanup-race-condition branch from f14390b to 44d706e Compare January 28, 2026 13:04
@matus-tomlein matus-tomlein changed the title fix: prevent orphaned DynamoDB entries when Kinesis shards close during pod termination fix: ensure closed shards with pending data are still claimed Jan 28, 2026
@matus-tomlein matus-tomlein requested a review from Copilot January 28, 2026 13:09
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…namodb and use pagination rather than the scan query
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@matus-tomlein
Copy link
Author

@jem-davies @gregfurman We have simplified the PR and addressed all the comments above – could you please give this another look?

@jem-davies
Copy link
Collaborator

Closed as included in #680

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

aws_kinesis: DynamoDB entries for closed shards not cleaned up during pod termination

3 participants

Comments