ScaleSet workers fail to start after pod restart due to stale github sessions

## Summary

When GARM restarts abnormally (crash, SIGQUIT or SIGKILL, etc.), scaleset workers fail to reconnect to GitHub with the error "The actions runner scaleset already has an active session." This prevents scaleset workers from starting, which causes **complete failure of the scaleset** - GARM cannot receive job notifications from GitHub, cannot provision new runners, and cannot clean up deleted instances. The scaleset becomes essentially non-functional until manual intervention.

Note: this bug report was written using Augment, however I can vouch for its accuracy, as we've run into this multiple times by now, and it's obvious the code is lacking a retry mechanism.

## Affected Versions

Likely affects all GARM versions using GitHub ScaleSets.

## Impact

**Production Observations**:

After GARM pod restart:
- **ScaleSet workers failed to start** with session conflict errors
- **GARM completely stopped processing GitHub jobs** - no new runners were provisioned
- **Job backlog accumulated** while scalesets were non-functional
- **25 runner instances stuck in `deleted` state** because cleanup requires running scaleset workers
- **Complete service outage** until manual intervention (toggle scalesets) restored functionality

## Root Cause

GARM does not (or sometimes cannot) properly clean up GitHub ScaleSet sessions when the pod terminates abnormally. When GARM restarts, it attempts to create new message sessions with GitHub, but GitHub still has the old sessions active, resulting in conflicts.

### Current Behavior

1. **GARM pod terminates** (crash, restart, SIGQUIT or SIGKILL, etc.)
2. **GitHub retains the old message session** for the scaleset
3. **GARM restarts** and tries to create new scaleset workers
4. **Workers fail immediately** when trying to create message sessions:
```
ERROR: failed to handle scale set create operation
error: error starting scale set worker: error starting listener: creating message session: 
failed to execute request to https://broker.actions.githubusercontent.com/rest/_apis/runtime/runnerscalesets/29/sessions
conflict: "The actions runner scaleset myscalesetname already has an active session."
```
5. **Scaleset workers never start** - they're in a failed state
6. **Complete outage** - no job notifications are received, no runners are provisioned, all GitHub Actions workflows depending on this scaleset queue indefinitely, and instances in `deleted` state linger in database

## Code Analysis

The scaleset worker creates message sessions in the listener startup code but has **no retry logic** for session conflicts:

### Logs Showing the Issue

```json
{"time":"2026-04-03T13:21:11.526284017Z","level":"INFO","msg":"starting consumer","consumer_id":"scaleset-worker-myscalesetname-2"}

{"time":"2026-04-03T13:21:12.218341942Z","level":"ERROR","msg":"failed to handle scale set create operation",
 "error":"error starting scale set worker: error starting listener: creating message session: 
 failed to execute request to https://broker.actions.githubusercontent.com/rest/_apis/runtime/runnerscalesets/29/sessions?api-version=6.0-preview: 
 conflict while calling https://broker.actions.githubusercontent.com/rest/_apis/runtime/runnerscalesets/29/sessions?api-version=6.0-preview: 
 \"{\\\"message\\\":\\\"The actions runner scaleset myscalesetname already has an active session.\\\"}\""}
```

The worker **starts a consumer** but immediately **fails and exits** - it doesn't retry.

## Consequences

### Critical Service Failure
- **ScaleSet workers don't run** after pod restart
- **Complete loss of GitHub job processing** - GARM cannot receive job notifications via message sessions
- **No new runners provisioned** - workflows queue indefinitely in GitHub
- **No instance cleanup** - deleted instances accumulate in database
- **Effectively a complete outage** for that scaleset until manual intervention

### No Automatic Recovery
**GARM does not retry or recover from session conflicts automatically.** Once a scaleset worker fails to start due to a session conflict, it remains in a failed state indefinitely. Manual intervention is required to restore service - either:
1. Restart the entire GARM pod (after allowing stale sessions to timeout)
2. Toggle the scaleset disabled/enabled via API (forces fresh session creation)

Additionally, the scaleset worker is responsible for cleaning up instances in `deleted` state from the database (scaleset.go:636-646):
```go
func (w *Worker) handleInstanceCleanup(instance params.Instance) error {
    if instance.Status == commonParams.InstanceDeleted {
        if err := w.store.DeleteInstanceByName(w.ctx, instance.Name); err != nil {
            if !errors.Is(err, runnerErrors.ErrNotFound) {
                return fmt.Errorf("deleting instance %s: %w", instance.ID, err)
            }
        }
        delete(w.runners, instance.ID)
    }
    return nil
}
```

Without running scaleset workers, deleted instances accumulate in the database (though this is cleared on the next successful restart).

## Proposed Fix

Implement retry logic with exponential backoff when encountering session conflicts:

```go
func (w *Worker) startListener() error {
    const maxRetries = 10
    baseDelay := 5 * time.Second
    
    for attempt := 0; attempt < maxRetries; attempt++ {
        session, err := w.createMessageSession()
        if err != nil {
            if isSessionConflictError(err) {
                delay := baseDelay * time.Duration(1<<attempt) // Exponential backoff
                if delay > 5*time.Minute {
                    delay = 5 * time.Minute // Cap at 5 minutes
                }
                
                slog.WarnContext(w.ctx, "session conflict detected, retrying",
                    "attempt", attempt+1,
                    "max_retries", maxRetries,
                    "retry_delay", delay,
                    "error", err)
                
                select {
                case <-time.After(delay):
                    continue
                case <-w.ctx.Done():
                    return w.ctx.Err()
                }
            }
            return err // Other errors fail immediately
        }
        
        // Success
        w.session = session
        return nil
    }
    
    return fmt.Errorf("failed to create message session after %d attempts", maxRetries)
}

func isSessionConflictError(err error) bool {
    return strings.Contains(err.Error(), "already has an active session")
}
```

### Why This Works

GitHub's stale sessions eventually timeout (typically within a few minutes). By retrying with exponential backoff:
1. We give GitHub time to clean up the stale session
2. We avoid hammering the GitHub API
3. We eventually succeed when the old session expires
4. The scaleset worker can start normally and resume operations

### Alternative: Proactive Session Cleanup

Implement graceful shutdown that explicitly closes GitHub sessions:
```go
func (w *Worker) Stop() error {
    // Close GitHub session before stopping
    if w.session != nil {
        if err := w.session.Close(); err != nil {
            slog.WarnContext(w.ctx, "failed to close session", "error", err)
        }
    }
    
    // Continue with normal shutdown
    // ...
}
```

However, this doesn't help with crashes or SIGKILL, so the retry logic is still necessary.

## Reproduction

1. Start GARM with active scalesets
2. Send SIGKILL or force restart the pod
3. Check GARM logs - observe scaleset workers fail to start with session conflict errors:
   ```
   ERROR: failed to handle scale set create operation
   error: error starting scale set worker: creating message session:
   "The actions runner scaleset already has an active session."
   ```
4. Verify scaleset worker goroutines are **not running**:
   ```bash
   curl "https://$GARM_URL/debug/pprof/goroutine?debug=2" | grep "scaleset-worker"
   # Should return nothing - workers failed to start
   ```
5. Observe on GitHub Actions that:
   - New workflow jobs queue indefinitely
   - No runners are being provisioned
   - Jobs remain in "Queued" state waiting for runners that never appear

## Current Workaround

Manual intervention required - toggle scalesets via API to force fresh session creation:

```bash
# Disable scalesets
curl -X PUT "https://$GARM_URL/api/v1/scalesets/$ID" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"enabled": false}'

# Wait a moment for cleanup
sleep 3

# Re-enable scalesets
curl -X PUT "https://$GARM_URL/api/v1/scalesets/$ID" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"enabled": true}'
```

This forces GARM to:
1. Stop the old (failed) workers
2. Wait for GitHub to clean up stale sessions
3. Create fresh workers with new sessions
4. Resume normal operations including instance cleanup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ScaleSet workers fail to start after pod restart due to stale github sessions #700

Summary

Affected Versions

Impact

Root Cause

Current Behavior

Code Analysis

Logs Showing the Issue

Consequences

Critical Service Failure

No Automatic Recovery

Proposed Fix

Why This Works

Alternative: Proactive Session Cleanup

Reproduction

Current Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ScaleSet workers fail to start after pod restart due to stale github sessions #700

Description

Summary

Affected Versions

Impact

Root Cause

Current Behavior

Code Analysis

Logs Showing the Issue

Consequences

Critical Service Failure

No Automatic Recovery

Proposed Fix

Why This Works

Alternative: Proactive Session Cleanup

Reproduction

Current Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions