Race condition causes instances to get stuck in `pending_delete` state

## Summary

Instance managers can fail to complete deletion operations due to a race condition between the provider worker and instance manager goroutines. This leaves instances permanently stuck in `pending_delete` state, requiring manual intervention (pod restart) to recover.

Note: this bug report was written using Augment, however I can vouch for it having personally combed through the code and confirmed that this is a very likely explanation for the leaked runners in `pending_delete` we see build up over time.

## Affected Versions

Likely affects all versions after the sync.Map refactoring (commits 928bcae1 and b959c7f1, March 20, 2026)

**Note**: All line numbers in this report reference `origin/main` (commit fe10bdb9 as of April 2026)

## Impact

- Runner instances accumulate in `pending_delete` state over time
- Instances stuck despite VMs being successfully deleted in the underlying provider
- Growth rate: 2-3 stuck instances per day (out of ~60 runners active at any given time typically)
- Required pod restart to trigger cleanup

## Root Cause

The issue occurs due to incomplete error handling after the sync.Map refactoring. When an instance manager is processing a deletion, it can be removed from `p.runners` map by the provider worker while still executing, causing subsequent `SetInstanceStatus()` calls to fail.

### Detailed Sequence

1. **Instance manager processes `pending_delete`** (instance_manager.go:328-352)
2. **Sets status to `deleting`** (line 329): `SetInstanceStatus(..., InstanceDeleting, ...)`
3. **Successfully deletes VM from provider** (line 336): `handleDeleteInstanceInProvider()`
4. **Meanwhile**: Database watcher sends an update event with status `deleted`
5. **Provider worker receives event** (provider.go:268-320)
6. **Worker calls `stopAndDeleteInstance()`** (line 307) which:
   - Stops the instance manager (line 262)
   - **Removes it from `p.runners`** (line 264): `p.runners.Delete(instance.Name)`
7. **Instance manager tries to finalize** (instance_manager.go:347):
   - Calls `SetInstanceStatus(..., InstanceDeleted, ...)` 
   - **This fails** with `ErrNotFound` because manager was removed from map
8. **Error is ignored** (line 348) and manager exits
9. **Database still shows `deleting` not `deleted`**
10. **No manager exists to retry** - instance is orphaned

### Alternative Failure Path

If deletion fails and the manager tries to revert to `pending_delete` (line 340):
```go
if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstancePendingDelete, []byte(err.Error()), true); err != nil {
    return fmt.Errorf("setting instance status to error: %w", err)
}
```
This call can also fail with `ErrNotFound`, and **there's no error handling** - the function just returns an error and the manager exits, leaving the instance stuck.

## Code Locations

### provider_helper.go (lines 59-81)
```go
func (p *Provider) SetInstanceStatus(instanceName string, status commonParams.InstanceStatus, providerFault []byte, force bool) error {
    if _, ok := p.runners.Load(instanceName); !ok {  // ← Fails if manager was removed
        return errors.ErrNotFound
    }
    // ...
}
```

### provider.go (lines 252-266)
```go
func (p *Provider) stopAndDeleteInstance(instance params.Instance) error {
    if instance.Status != commonParams.InstanceDeleted {
        return nil
    }
    val, ok := p.runners.Load(instance.Name)
    if !ok {
        return nil
    }
    existingInstance := val.(*instanceManager)
    if err := existingInstance.Stop(); err != nil {
        return fmt.Errorf("failed to stop instance manager: %w", err)
    }
    p.runners.Delete(instance.Name)  // ← Removes manager from map
    return nil
}
```

### instance_manager.go (lines 328-352)
```go
case commonParams.InstancePendingDelete, commonParams.InstancePendingForceDelete:
    prevStatus := i.instance.Status
    if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstanceDeleting, nil, true); err != nil {
        if errors.Is(err, runnerErrors.ErrNotFound) {
            return nil  // Manager was removed - silently exit
        }
        return fmt.Errorf("setting instance status to deleting: %w", err)
    }

    if err := i.handleDeleteInstanceInProvider(i.instance); err != nil {
        slog.ErrorContext(i.ctx, "deleting instance in provider", "error", err)
        if prevStatus == commonParams.InstancePendingDelete {
            i.incrementBackOff()
            // ↓ This can fail with ErrNotFound, no handling!
            if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstancePendingDelete, []byte(err.Error()), true); err != nil {
                return fmt.Errorf("setting instance status to error: %w", err)
            }
            return fmt.Errorf("error removing instance. Will retry: %w", err)
        }
    }
    // ↓ This can fail with ErrNotFound, error is ignored
    if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstanceDeleted, nil, false); err != nil {
        if !errors.Is(err, runnerErrors.ErrNotFound) {
            return fmt.Errorf("setting instance status to deleted: %w", err)
        }
    }
    return ErrInstanceDeleted
```

## Why Pod Restart Fixes It

When GARM restarts:
1. `loadAllRunners()` (provider.go:85-137) is called
2. It loads **all instances** from the database except those in `deleting` or `deleted` state
3. Creates fresh instance managers for instances in `pending_delete`
4. These new managers successfully complete the deletion

## Proposed Fixes

### Option 1: Don't Remove Manager Until Database Confirms Deletion

Modify `stopAndDeleteInstance()` to only stop the manager, not remove it:
```go
func (p *Provider) stopAndDeleteInstance(instance params.Instance) error {
    if instance.Status != commonParams.InstanceDeleted {
        return nil
    }
    val, ok := p.runners.Load(instance.Name)
    if ok {
        existingInstance := val.(*instanceManager)
        if err := existingInstance.Stop(); err != nil {
            return fmt.Errorf("failed to stop instance manager: %w", err)
        }
        // Don't delete from map yet - let it happen naturally when manager exits
    }
    return nil
}
```

### Option 2: Use Database Directly for Status Updates During Deletion

Modify instance manager to bypass the helper and write directly to database during critical deletion path:
```go
if err := i.store.UpdateInstance(i.ctx, i.instance.Name, params.UpdateInstanceParams{
    Status: commonParams.InstanceDeleted,
}); err != nil {
    return fmt.Errorf("setting instance status to deleted: %w", err)
}
```

### Option 3: Implement Proper Retry Logic

Add retry logic when `SetInstanceStatus()` fails with `ErrNotFound`:
```go
const maxRetries = 3
for retry := 0; retry < maxRetries; retry++ {
    if err := i.helper.SetInstanceStatus(...); err != nil {
        if errors.Is(err, runnerErrors.ErrNotFound) {
            time.Sleep(100 * time.Millisecond)
            continue
        }
        return err
    }
    break
}
```

## Reproduction

This is a timing-dependent race condition that occurs more frequently under:
- High instance churn (many creates/deletes)
- Database watcher latency causing events to arrive out of order
- Provider operations completing very quickly

## Workaround

If instances get stuck in `pending_delete`:
1. Restart the GARM pod to reload instance managers
2. Fresh managers will complete the deletions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition causes instances to get stuck in `pending_delete` state #699

Summary

Affected Versions

Impact

Root Cause

Detailed Sequence

Alternative Failure Path

Code Locations

provider_helper.go (lines 59-81)

provider.go (lines 252-266)

instance_manager.go (lines 328-352)

Why Pod Restart Fixes It

Proposed Fixes

Option 1: Don't Remove Manager Until Database Confirms Deletion

Option 2: Use Database Directly for Status Updates During Deletion

Option 3: Implement Proper Retry Logic

Reproduction

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race condition causes instances to get stuck in pending_delete state #699

Description

Summary

Affected Versions

Impact

Root Cause

Detailed Sequence

Alternative Failure Path

Code Locations

provider_helper.go (lines 59-81)

provider.go (lines 252-266)

instance_manager.go (lines 328-352)

Why Pod Restart Fixes It

Proposed Fixes

Option 1: Don't Remove Manager Until Database Confirms Deletion

Option 2: Use Database Directly for Status Updates During Deletion

Option 3: Implement Proper Retry Logic

Reproduction

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Race condition causes instances to get stuck in `pending_delete` state #699