Skip to content

dekaf: add missing timeout to fetch_partitions() in task manager loop#2706

Open
jshearer wants to merge 2 commits intomasterfrom
dekaf/fetch-partitions-timeout
Open

dekaf: add missing timeout to fetch_partitions() in task manager loop#2706
jshearer wants to merge 2 commits intomasterfrom
dekaf/fetch-partitions-timeout

Conversation

@jshearer
Copy link
Contributor

Summary

The Dekaf task manager is responsible for fetching and handing out fresh information about tasks that Dekaf is reading from. In order to do this, it needs to call out to various places: /authorize/dekaf, /authorize/task, and Gazette for journal info. That journal gRPC listing was the only network call in the task manager loop not wrapped in a timeout. If the RPC hangs for whatever reason, the task manager loop blocks permanently, the cached access token expires, and all sessions receive "Access token has expired and the task manager has been unable to refresh it." indefinitely until the Dekaf process is restarted.

This wraps fetch_partitions() in tokio::time::timeout(timeout, ...), matching the pattern used by get_or_refresh_dekaf_auth() and get_or_refresh_journal_client().

Incident context

Recently, we had an incident where a Gazette broker went down for reasons we are still investigating. After the dataplane recovered, 3 of 4 dekaf instances remained permanently stuck. Restarting all dekaf instances immediately resolved the issue.

The stuck instances had zero "Successful task manager run" log entries after the outage began, while the remaining healthy instance continued logging successful runs every 30 seconds. Additionally, sessions were unable to fetch tokens from the task manager due to Access token has expired and the task manager has been unable to refresh it.. This is consistent with the task manager loop being blocked on a hung journal_client.list() call inside fetch_partitions().

@jshearer jshearer requested review from a team February 23, 2026 19:02
@jgraettinger
Copy link
Member

I'm not sure how I feel about this. It does what it says, but I'm pretty uncomfortable with continuing to layer patches on top of task_manager.rs. I really think we need to tackle a refactor into the tokens crate (as started here)

@jshearer jshearer force-pushed the dekaf/fetch-partitions-timeout branch from 1799b33 to 3e057cb Compare March 2, 2026 20:02
jshearer added 2 commits March 3, 2026 13:51
`fetch_partitions()` calls `journal_client.list()`. This was the only network call in the task manager loop not wrapped with a timeout. If the RPC hangs for whatever reason, the task manager loop blocks permanently, the published access token expires, and all sessions get "Access token has expired and the task manager has been unable to refresh it." indefinitely.
@jshearer jshearer force-pushed the dekaf/fetch-partitions-timeout branch 3 times, most recently from 7ee707d to 86c5520 Compare March 3, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants