-
Notifications
You must be signed in to change notification settings - Fork 46
Race condition causes instances to get stuck in pending_delete state #699
Description
Summary
Instance managers can fail to complete deletion operations due to a race condition between the provider worker and instance manager goroutines. This leaves instances permanently stuck in pending_delete state, requiring manual intervention (pod restart) to recover.
Note: this bug report was written using Augment, however I can vouch for it having personally combed through the code and confirmed that this is a very likely explanation for the leaked runners in pending_delete we see build up over time.
Affected Versions
Likely affects all versions after the sync.Map refactoring (commits 928bcae and b959c7f, March 20, 2026)
Note: All line numbers in this report reference origin/main (commit fe10bdb as of April 2026)
Impact
- Runner instances accumulate in
pending_deletestate over time - Instances stuck despite VMs being successfully deleted in the underlying provider
- Growth rate: 2-3 stuck instances per day (out of ~60 runners active at any given time typically)
- Required pod restart to trigger cleanup
Root Cause
The issue occurs due to incomplete error handling after the sync.Map refactoring. When an instance manager is processing a deletion, it can be removed from p.runners map by the provider worker while still executing, causing subsequent SetInstanceStatus() calls to fail.
Detailed Sequence
- Instance manager processes
pending_delete(instance_manager.go:328-352) - Sets status to
deleting(line 329):SetInstanceStatus(..., InstanceDeleting, ...) - Successfully deletes VM from provider (line 336):
handleDeleteInstanceInProvider() - Meanwhile: Database watcher sends an update event with status
deleted - Provider worker receives event (provider.go:268-320)
- Worker calls
stopAndDeleteInstance()(line 307) which:- Stops the instance manager (line 262)
- Removes it from
p.runners(line 264):p.runners.Delete(instance.Name)
- Instance manager tries to finalize (instance_manager.go:347):
- Calls
SetInstanceStatus(..., InstanceDeleted, ...) - This fails with
ErrNotFoundbecause manager was removed from map
- Calls
- Error is ignored (line 348) and manager exits
- Database still shows
deletingnotdeleted - No manager exists to retry - instance is orphaned
Alternative Failure Path
If deletion fails and the manager tries to revert to pending_delete (line 340):
if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstancePendingDelete, []byte(err.Error()), true); err != nil {
return fmt.Errorf("setting instance status to error: %w", err)
}This call can also fail with ErrNotFound, and there's no error handling - the function just returns an error and the manager exits, leaving the instance stuck.
Code Locations
provider_helper.go (lines 59-81)
func (p *Provider) SetInstanceStatus(instanceName string, status commonParams.InstanceStatus, providerFault []byte, force bool) error {
if _, ok := p.runners.Load(instanceName); !ok { // ← Fails if manager was removed
return errors.ErrNotFound
}
// ...
}provider.go (lines 252-266)
func (p *Provider) stopAndDeleteInstance(instance params.Instance) error {
if instance.Status != commonParams.InstanceDeleted {
return nil
}
val, ok := p.runners.Load(instance.Name)
if !ok {
return nil
}
existingInstance := val.(*instanceManager)
if err := existingInstance.Stop(); err != nil {
return fmt.Errorf("failed to stop instance manager: %w", err)
}
p.runners.Delete(instance.Name) // ← Removes manager from map
return nil
}instance_manager.go (lines 328-352)
case commonParams.InstancePendingDelete, commonParams.InstancePendingForceDelete:
prevStatus := i.instance.Status
if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstanceDeleting, nil, true); err != nil {
if errors.Is(err, runnerErrors.ErrNotFound) {
return nil // Manager was removed - silently exit
}
return fmt.Errorf("setting instance status to deleting: %w", err)
}
if err := i.handleDeleteInstanceInProvider(i.instance); err != nil {
slog.ErrorContext(i.ctx, "deleting instance in provider", "error", err)
if prevStatus == commonParams.InstancePendingDelete {
i.incrementBackOff()
// ↓ This can fail with ErrNotFound, no handling!
if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstancePendingDelete, []byte(err.Error()), true); err != nil {
return fmt.Errorf("setting instance status to error: %w", err)
}
return fmt.Errorf("error removing instance. Will retry: %w", err)
}
}
// ↓ This can fail with ErrNotFound, error is ignored
if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstanceDeleted, nil, false); err != nil {
if !errors.Is(err, runnerErrors.ErrNotFound) {
return fmt.Errorf("setting instance status to deleted: %w", err)
}
}
return ErrInstanceDeletedWhy Pod Restart Fixes It
When GARM restarts:
loadAllRunners()(provider.go:85-137) is called- It loads all instances from the database except those in
deletingordeletedstate - Creates fresh instance managers for instances in
pending_delete - These new managers successfully complete the deletion
Proposed Fixes
Option 1: Don't Remove Manager Until Database Confirms Deletion
Modify stopAndDeleteInstance() to only stop the manager, not remove it:
func (p *Provider) stopAndDeleteInstance(instance params.Instance) error {
if instance.Status != commonParams.InstanceDeleted {
return nil
}
val, ok := p.runners.Load(instance.Name)
if ok {
existingInstance := val.(*instanceManager)
if err := existingInstance.Stop(); err != nil {
return fmt.Errorf("failed to stop instance manager: %w", err)
}
// Don't delete from map yet - let it happen naturally when manager exits
}
return nil
}Option 2: Use Database Directly for Status Updates During Deletion
Modify instance manager to bypass the helper and write directly to database during critical deletion path:
if err := i.store.UpdateInstance(i.ctx, i.instance.Name, params.UpdateInstanceParams{
Status: commonParams.InstanceDeleted,
}); err != nil {
return fmt.Errorf("setting instance status to deleted: %w", err)
}Option 3: Implement Proper Retry Logic
Add retry logic when SetInstanceStatus() fails with ErrNotFound:
const maxRetries = 3
for retry := 0; retry < maxRetries; retry++ {
if err := i.helper.SetInstanceStatus(...); err != nil {
if errors.Is(err, runnerErrors.ErrNotFound) {
time.Sleep(100 * time.Millisecond)
continue
}
return err
}
break
}Reproduction
This is a timing-dependent race condition that occurs more frequently under:
- High instance churn (many creates/deletes)
- Database watcher latency causing events to arrive out of order
- Provider operations completing very quickly
Workaround
If instances get stuck in pending_delete:
- Restart the GARM pod to reload instance managers
- Fresh managers will complete the deletions