fix: ensure closed shards with pending data are still claimed#644
fix: ensure closed shards with pending data are still claimed#644matus-tomlein wants to merge 2 commits intowarpstreamlabs:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request fixes a race condition where DynamoDB lease table entries for closed Kinesis shards are not cleaned up when Bento pods are terminated during Kubernetes scaling events, leading to orphaned entries and false positive latency alerts.
- Adds conditional cleanup logic during pod shutdown to delete checkpoints for finished shards instead of saving them
- Implements periodic background cleanup that removes DynamoDB entries for shards that have been closed by Kinesis
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f14390b to
44d706e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…namodb and use pagination rather than the scan query
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@jem-davies @gregfurman We have simplified the PR and addressed all the comments above – could you please give this another look? |
|
Closed as included in #680 |
Summary
Fixes #643
Closed shards (with
EndingSequenceNumber) were being skipped during rebalancing, even if they still had unprocessed records. This caused data loss when a consumer restarted before fully draining a closed shard.The Problem
When a shard is closed, the old logic would skip it entirely during rebalancing:
This caused a race condition:
The Fix
Now we also check if a closed shard has a checkpoint in DynamoDB. If a checkpoint exists, the shard hasn't been fully consumed yet (since checkpointer.Delete() removes it upon completion), so we include it in unclaimedShards:
Test plan
Added unit tests for
isShardFinished()covering open, closed, and edge cases