dekaf: add missing timeout to fetch_partitions() in task manager loop#2706
Open
dekaf: add missing timeout to fetch_partitions() in task manager loop#2706
fetch_partitions() in task manager loop#2706Conversation
Member
|
I'm not sure how I feel about this. It does what it says, but I'm pretty uncomfortable with continuing to layer patches on top of |
1799b33 to
3e057cb
Compare
`fetch_partitions()` calls `journal_client.list()`. This was the only network call in the task manager loop not wrapped with a timeout. If the RPC hangs for whatever reason, the task manager loop blocks permanently, the published access token expires, and all sessions get "Access token has expired and the task manager has been unable to refresh it." indefinitely.
7ee707d to
86c5520
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The Dekaf task manager is responsible for fetching and handing out fresh information about tasks that Dekaf is reading from. In order to do this, it needs to call out to various places:
/authorize/dekaf,/authorize/task, and Gazette for journal info. That journal gRPC listing was the only network call in the task manager loop not wrapped in a timeout. If the RPC hangs for whatever reason, the task manager loop blocks permanently, the cached access token expires, and all sessions receive "Access token has expired and the task manager has been unable to refresh it." indefinitely until the Dekaf process is restarted.This wraps
fetch_partitions()intokio::time::timeout(timeout, ...), matching the pattern used byget_or_refresh_dekaf_auth()andget_or_refresh_journal_client().Incident context
Recently, we had an incident where a Gazette broker went down for reasons we are still investigating. After the dataplane recovered, 3 of 4 dekaf instances remained permanently stuck. Restarting all dekaf instances immediately resolved the issue.
The stuck instances had zero "Successful task manager run" log entries after the outage began, while the remaining healthy instance continued logging successful runs every 30 seconds. Additionally, sessions were unable to fetch tokens from the task manager due to
Access token has expired and the task manager has been unable to refresh it.. This is consistent with the task manager loop being blocked on a hungjournal_client.list()call insidefetch_partitions().