Skip to content

Fix orphaned in_progress jobs after worker crash/restart#59

Open
Manuscrit wants to merge 1 commit intolongtermrisk:v0.9from
slacki-ai:fix/worker-orphaned-jobs-cleanup
Open

Fix orphaned in_progress jobs after worker crash/restart#59
Manuscrit wants to merge 1 commit intolongtermrisk:v0.9from
slacki-ai:fix/worker-orphaned-jobs-cleanup

Conversation

@Manuscrit
Copy link
Copy Markdown
Collaborator

When a worker pod crashes (OOM, SIGKILL, power loss), the atexit shutdown handler never fires, leaving jobs stuck in in_progress status. If the pod restarts quickly with the same worker_id, the cluster manager's unresponsive-worker cleanup never triggers either (the worker keeps pinging). This causes orphaned jobs to accumulate — each crash leaves one more zombie in_progress job that no worker is executing.

Fix: on startup, before entering the job loop, revert any in_progress jobs assigned to this worker_id back to pending. A freshly started worker process cannot be executing anything, so any such jobs are guaranteed orphans from a previous lifetime.

When a worker pod crashes (OOM, SIGKILL, power loss), the atexit
shutdown handler never fires, leaving jobs stuck in in_progress status.
If the pod restarts quickly with the same worker_id, the cluster
manager's unresponsive-worker cleanup never triggers either (the worker
keeps pinging). This causes orphaned jobs to accumulate — each crash
leaves one more zombie in_progress job that no worker is executing.

Fix: on startup, before entering the job loop, revert any in_progress
jobs assigned to this worker_id back to pending. A freshly started
worker process cannot be executing anything, so any such jobs are
guaranteed orphans from a previous lifetime.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants