Skip to content

Race condition causes instances to get stuck in pending_delete state #699

@benoit-nexthop

Description

@benoit-nexthop

Summary

Instance managers can fail to complete deletion operations due to a race condition between the provider worker and instance manager goroutines. This leaves instances permanently stuck in pending_delete state, requiring manual intervention (pod restart) to recover.

Note: this bug report was written using Augment, however I can vouch for it having personally combed through the code and confirmed that this is a very likely explanation for the leaked runners in pending_delete we see build up over time.

Affected Versions

Likely affects all versions after the sync.Map refactoring (commits 928bcae and b959c7f, March 20, 2026)

Note: All line numbers in this report reference origin/main (commit fe10bdb as of April 2026)

Impact

  • Runner instances accumulate in pending_delete state over time
  • Instances stuck despite VMs being successfully deleted in the underlying provider
  • Growth rate: 2-3 stuck instances per day (out of ~60 runners active at any given time typically)
  • Required pod restart to trigger cleanup

Root Cause

The issue occurs due to incomplete error handling after the sync.Map refactoring. When an instance manager is processing a deletion, it can be removed from p.runners map by the provider worker while still executing, causing subsequent SetInstanceStatus() calls to fail.

Detailed Sequence

  1. Instance manager processes pending_delete (instance_manager.go:328-352)
  2. Sets status to deleting (line 329): SetInstanceStatus(..., InstanceDeleting, ...)
  3. Successfully deletes VM from provider (line 336): handleDeleteInstanceInProvider()
  4. Meanwhile: Database watcher sends an update event with status deleted
  5. Provider worker receives event (provider.go:268-320)
  6. Worker calls stopAndDeleteInstance() (line 307) which:
    • Stops the instance manager (line 262)
    • Removes it from p.runners (line 264): p.runners.Delete(instance.Name)
  7. Instance manager tries to finalize (instance_manager.go:347):
    • Calls SetInstanceStatus(..., InstanceDeleted, ...)
    • This fails with ErrNotFound because manager was removed from map
  8. Error is ignored (line 348) and manager exits
  9. Database still shows deleting not deleted
  10. No manager exists to retry - instance is orphaned

Alternative Failure Path

If deletion fails and the manager tries to revert to pending_delete (line 340):

if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstancePendingDelete, []byte(err.Error()), true); err != nil {
    return fmt.Errorf("setting instance status to error: %w", err)
}

This call can also fail with ErrNotFound, and there's no error handling - the function just returns an error and the manager exits, leaving the instance stuck.

Code Locations

provider_helper.go (lines 59-81)

func (p *Provider) SetInstanceStatus(instanceName string, status commonParams.InstanceStatus, providerFault []byte, force bool) error {
    if _, ok := p.runners.Load(instanceName); !ok {  // ← Fails if manager was removed
        return errors.ErrNotFound
    }
    // ...
}

provider.go (lines 252-266)

func (p *Provider) stopAndDeleteInstance(instance params.Instance) error {
    if instance.Status != commonParams.InstanceDeleted {
        return nil
    }
    val, ok := p.runners.Load(instance.Name)
    if !ok {
        return nil
    }
    existingInstance := val.(*instanceManager)
    if err := existingInstance.Stop(); err != nil {
        return fmt.Errorf("failed to stop instance manager: %w", err)
    }
    p.runners.Delete(instance.Name)  // ← Removes manager from map
    return nil
}

instance_manager.go (lines 328-352)

case commonParams.InstancePendingDelete, commonParams.InstancePendingForceDelete:
    prevStatus := i.instance.Status
    if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstanceDeleting, nil, true); err != nil {
        if errors.Is(err, runnerErrors.ErrNotFound) {
            return nil  // Manager was removed - silently exit
        }
        return fmt.Errorf("setting instance status to deleting: %w", err)
    }

    if err := i.handleDeleteInstanceInProvider(i.instance); err != nil {
        slog.ErrorContext(i.ctx, "deleting instance in provider", "error", err)
        if prevStatus == commonParams.InstancePendingDelete {
            i.incrementBackOff()
            // ↓ This can fail with ErrNotFound, no handling!
            if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstancePendingDelete, []byte(err.Error()), true); err != nil {
                return fmt.Errorf("setting instance status to error: %w", err)
            }
            return fmt.Errorf("error removing instance. Will retry: %w", err)
        }
    }
    // ↓ This can fail with ErrNotFound, error is ignored
    if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstanceDeleted, nil, false); err != nil {
        if !errors.Is(err, runnerErrors.ErrNotFound) {
            return fmt.Errorf("setting instance status to deleted: %w", err)
        }
    }
    return ErrInstanceDeleted

Why Pod Restart Fixes It

When GARM restarts:

  1. loadAllRunners() (provider.go:85-137) is called
  2. It loads all instances from the database except those in deleting or deleted state
  3. Creates fresh instance managers for instances in pending_delete
  4. These new managers successfully complete the deletion

Proposed Fixes

Option 1: Don't Remove Manager Until Database Confirms Deletion

Modify stopAndDeleteInstance() to only stop the manager, not remove it:

func (p *Provider) stopAndDeleteInstance(instance params.Instance) error {
    if instance.Status != commonParams.InstanceDeleted {
        return nil
    }
    val, ok := p.runners.Load(instance.Name)
    if ok {
        existingInstance := val.(*instanceManager)
        if err := existingInstance.Stop(); err != nil {
            return fmt.Errorf("failed to stop instance manager: %w", err)
        }
        // Don't delete from map yet - let it happen naturally when manager exits
    }
    return nil
}

Option 2: Use Database Directly for Status Updates During Deletion

Modify instance manager to bypass the helper and write directly to database during critical deletion path:

if err := i.store.UpdateInstance(i.ctx, i.instance.Name, params.UpdateInstanceParams{
    Status: commonParams.InstanceDeleted,
}); err != nil {
    return fmt.Errorf("setting instance status to deleted: %w", err)
}

Option 3: Implement Proper Retry Logic

Add retry logic when SetInstanceStatus() fails with ErrNotFound:

const maxRetries = 3
for retry := 0; retry < maxRetries; retry++ {
    if err := i.helper.SetInstanceStatus(...); err != nil {
        if errors.Is(err, runnerErrors.ErrNotFound) {
            time.Sleep(100 * time.Millisecond)
            continue
        }
        return err
    }
    break
}

Reproduction

This is a timing-dependent race condition that occurs more frequently under:

  • High instance churn (many creates/deletes)
  • Database watcher latency causing events to arrive out of order
  • Provider operations completing very quickly

Workaround

If instances get stuck in pending_delete:

  1. Restart the GARM pod to reload instance managers
  2. Fresh managers will complete the deletions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions