fix: ensure closed shards with pending data are still claimed by matus-tomlein · Pull Request #644 · warpstreamlabs/bento

matus-tomlein · 2026-01-08T11:03:32Z

Summary

Fixes #643

Closed shards (with EndingSequenceNumber) were being skipped during rebalancing, even if they still had unprocessed records. This caused data loss when a consumer restarted before fully draining a closed shard.

The Problem

When a shard is closed, the old logic would skip it entirely during rebalancing:

for _, s := range allShards {
    if !isShardFinished(s) {  // ← Closed shards skipped!
        unclaimedShards[*s.ShardId] = ""
    }
}

This caused a race condition:

Consumer is reading from Shard-0
Kinesis scales down → Shard-0 gets CLOSED (EndingSequenceNumber set)
Consumer hasn't finished draining Shard-0 yet (still has records)
Rebalance runs:
- isShardFinished(Shard-0) = true
- Shard-0 is NOT added to unclaimedShards
- No consumer gets spawned for it
Consumer pod restarts (or crashes)
Shard-0 is orphaned — nobody will ever finish consuming it

The Fix

Now we also check if a closed shard has a checkpoint in DynamoDB. If a checkpoint exists, the shard hasn't been fully consumed yet (since checkpointer.Delete() removes it upon completion), so we include it in unclaimedShards:

if !isShardFinished(s) || shardsWithCheckpoints[shardID] {
    unclaimedShards[shardID] = ""
}

Test plan

Added unit tests for isShardFinished() covering open, closed, and edge cases

Copilot

Pull request overview

This pull request fixes a race condition where DynamoDB lease table entries for closed Kinesis shards are not cleaned up when Bento pods are terminated during Kubernetes scaling events, leading to orphaned entries and false positive latency alerts.

Adds conditional cleanup logic during pod shutdown to delete checkpoints for finished shards instead of saving them
Implements periodic background cleanup that removes DynamoDB entries for shards that have been closed by Kinesis

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/impl/aws/input_kinesis.go

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/impl/aws/input_kinesis.go

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/impl/aws/input_kinesis_checkpointer.go

internal/impl/aws/input_kinesis.go

…namodb and use pagination rather than the scan query

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

matus-tomlein · 2026-01-28T14:32:23Z

@jem-davies @gregfurman We have simplified the PR and addressed all the comments above – could you please give this another look?

jem-davies · 2026-02-03T13:55:24Z

Closed as included in #680

Copilot AI review requested due to automatic review settings January 8, 2026 11:03

matus-tomlein requested review from gregfurman and jem-davies as code owners January 8, 2026 11:03

Copilot started reviewing on behalf of matus-tomlein January 8, 2026 11:04 View session

Copilot AI reviewed Jan 8, 2026

View reviewed changes

internal/impl/aws/input_kinesis.go Outdated Show resolved Hide resolved

internal/impl/aws/input_kinesis.go Outdated Show resolved Hide resolved

matus-tomlein requested a review from Copilot January 8, 2026 12:05

Copilot started reviewing on behalf of matus-tomlein January 8, 2026 12:05 View session

Copilot AI reviewed Jan 8, 2026

View reviewed changes

internal/impl/aws/input_kinesis.go Outdated Show resolved Hide resolved

internal/impl/aws/input_kinesis.go Show resolved Hide resolved

internal/impl/aws/input_kinesis.go Outdated Show resolved Hide resolved

Copilot AI reviewed Jan 8, 2026

View reviewed changes

fix unclaimed shards not being completed

44d706e

matus-tomlein force-pushed the fix/kinesis-shard-cleanup-race-condition branch from f14390b to 44d706e Compare January 28, 2026 13:04

matus-tomlein changed the title ~~fix: prevent orphaned DynamoDB entries when Kinesis shards close during pod termination~~ fix: ensure closed shards with pending data are still claimed Jan 28, 2026

matus-tomlein requested a review from Copilot January 28, 2026 13:09

Copilot started reviewing on behalf of matus-tomlein January 28, 2026 13:10 View session

Copilot AI reviewed Jan 28, 2026

View reviewed changes

internal/impl/aws/input_kinesis_checkpointer.go Outdated Show resolved Hide resolved

internal/impl/aws/input_kinesis.go Show resolved Hide resolved

Use a single query for both retrieving claims and checkpoints from dy…

d5dda26

…namodb and use pagination rather than the scan query

matus-tomlein requested a review from Copilot January 28, 2026 14:20

Copilot started reviewing on behalf of matus-tomlein January 28, 2026 14:21 View session

Copilot AI reviewed Jan 28, 2026

View reviewed changes

jem-davies self-assigned this Jan 30, 2026

jem-davies mentioned this pull request Feb 3, 2026

Add support for AWS Kinesis Enhanced Fan-Out as an input #680

Open

jem-davies closed this Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ensure closed shards with pending data are still claimed#644

fix: ensure closed shards with pending data are still claimed#644
matus-tomlein wants to merge 2 commits intowarpstreamlabs:mainfrom
matus-tomlein:fix/kinesis-shard-cleanup-race-condition

matus-tomlein commented Jan 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

matus-tomlein commented Jan 28, 2026

Uh oh!

jem-davies commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

matus-tomlein commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The Problem

The Fix

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

matus-tomlein commented Jan 28, 2026

Uh oh!

jem-davies commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

matus-tomlein commented Jan 8, 2026 •

edited

Loading