Skip to content

ScaleSet workers fail to start after pod restart due to stale github sessions #700

@benoit-nexthop

Description

@benoit-nexthop

Summary

When GARM restarts abnormally (crash, SIGQUIT or SIGKILL, etc.), scaleset workers fail to reconnect to GitHub with the error "The actions runner scaleset already has an active session." This prevents scaleset workers from starting, which causes complete failure of the scaleset - GARM cannot receive job notifications from GitHub, cannot provision new runners, and cannot clean up deleted instances. The scaleset becomes essentially non-functional until manual intervention.

Note: this bug report was written using Augment, however I can vouch for its accuracy, as we've run into this multiple times by now, and it's obvious the code is lacking a retry mechanism.

Affected Versions

Likely affects all GARM versions using GitHub ScaleSets.

Impact

Production Observations:

After GARM pod restart:

  • ScaleSet workers failed to start with session conflict errors
  • GARM completely stopped processing GitHub jobs - no new runners were provisioned
  • Job backlog accumulated while scalesets were non-functional
  • 25 runner instances stuck in deleted state because cleanup requires running scaleset workers
  • Complete service outage until manual intervention (toggle scalesets) restored functionality

Root Cause

GARM does not (or sometimes cannot) properly clean up GitHub ScaleSet sessions when the pod terminates abnormally. When GARM restarts, it attempts to create new message sessions with GitHub, but GitHub still has the old sessions active, resulting in conflicts.

Current Behavior

  1. GARM pod terminates (crash, restart, SIGQUIT or SIGKILL, etc.)
  2. GitHub retains the old message session for the scaleset
  3. GARM restarts and tries to create new scaleset workers
  4. Workers fail immediately when trying to create message sessions:
ERROR: failed to handle scale set create operation
error: error starting scale set worker: error starting listener: creating message session: 
failed to execute request to https://broker.actions.githubusercontent.com/rest/_apis/runtime/runnerscalesets/29/sessions
conflict: "The actions runner scaleset myscalesetname already has an active session."
  1. Scaleset workers never start - they're in a failed state
  2. Complete outage - no job notifications are received, no runners are provisioned, all GitHub Actions workflows depending on this scaleset queue indefinitely, and instances in deleted state linger in database

Code Analysis

The scaleset worker creates message sessions in the listener startup code but has no retry logic for session conflicts:

Logs Showing the Issue

{"time":"2026-04-03T13:21:11.526284017Z","level":"INFO","msg":"starting consumer","consumer_id":"scaleset-worker-myscalesetname-2"}

{"time":"2026-04-03T13:21:12.218341942Z","level":"ERROR","msg":"failed to handle scale set create operation",
 "error":"error starting scale set worker: error starting listener: creating message session: 
 failed to execute request to https://broker.actions.githubusercontent.com/rest/_apis/runtime/runnerscalesets/29/sessions?api-version=6.0-preview: 
 conflict while calling https://broker.actions.githubusercontent.com/rest/_apis/runtime/runnerscalesets/29/sessions?api-version=6.0-preview: 
 \"{\\\"message\\\":\\\"The actions runner scaleset myscalesetname already has an active session.\\\"}\""}

The worker starts a consumer but immediately fails and exits - it doesn't retry.

Consequences

Critical Service Failure

  • ScaleSet workers don't run after pod restart
  • Complete loss of GitHub job processing - GARM cannot receive job notifications via message sessions
  • No new runners provisioned - workflows queue indefinitely in GitHub
  • No instance cleanup - deleted instances accumulate in database
  • Effectively a complete outage for that scaleset until manual intervention

No Automatic Recovery

GARM does not retry or recover from session conflicts automatically. Once a scaleset worker fails to start due to a session conflict, it remains in a failed state indefinitely. Manual intervention is required to restore service - either:

  1. Restart the entire GARM pod (after allowing stale sessions to timeout)
  2. Toggle the scaleset disabled/enabled via API (forces fresh session creation)

Additionally, the scaleset worker is responsible for cleaning up instances in deleted state from the database (scaleset.go:636-646):

func (w *Worker) handleInstanceCleanup(instance params.Instance) error {
    if instance.Status == commonParams.InstanceDeleted {
        if err := w.store.DeleteInstanceByName(w.ctx, instance.Name); err != nil {
            if !errors.Is(err, runnerErrors.ErrNotFound) {
                return fmt.Errorf("deleting instance %s: %w", instance.ID, err)
            }
        }
        delete(w.runners, instance.ID)
    }
    return nil
}

Without running scaleset workers, deleted instances accumulate in the database (though this is cleared on the next successful restart).

Proposed Fix

Implement retry logic with exponential backoff when encountering session conflicts:

func (w *Worker) startListener() error {
    const maxRetries = 10
    baseDelay := 5 * time.Second
    
    for attempt := 0; attempt < maxRetries; attempt++ {
        session, err := w.createMessageSession()
        if err != nil {
            if isSessionConflictError(err) {
                delay := baseDelay * time.Duration(1<<attempt) // Exponential backoff
                if delay > 5*time.Minute {
                    delay = 5 * time.Minute // Cap at 5 minutes
                }
                
                slog.WarnContext(w.ctx, "session conflict detected, retrying",
                    "attempt", attempt+1,
                    "max_retries", maxRetries,
                    "retry_delay", delay,
                    "error", err)
                
                select {
                case <-time.After(delay):
                    continue
                case <-w.ctx.Done():
                    return w.ctx.Err()
                }
            }
            return err // Other errors fail immediately
        }
        
        // Success
        w.session = session
        return nil
    }
    
    return fmt.Errorf("failed to create message session after %d attempts", maxRetries)
}

func isSessionConflictError(err error) bool {
    return strings.Contains(err.Error(), "already has an active session")
}

Why This Works

GitHub's stale sessions eventually timeout (typically within a few minutes). By retrying with exponential backoff:

  1. We give GitHub time to clean up the stale session
  2. We avoid hammering the GitHub API
  3. We eventually succeed when the old session expires
  4. The scaleset worker can start normally and resume operations

Alternative: Proactive Session Cleanup

Implement graceful shutdown that explicitly closes GitHub sessions:

func (w *Worker) Stop() error {
    // Close GitHub session before stopping
    if w.session != nil {
        if err := w.session.Close(); err != nil {
            slog.WarnContext(w.ctx, "failed to close session", "error", err)
        }
    }
    
    // Continue with normal shutdown
    // ...
}

However, this doesn't help with crashes or SIGKILL, so the retry logic is still necessary.

Reproduction

  1. Start GARM with active scalesets
  2. Send SIGKILL or force restart the pod
  3. Check GARM logs - observe scaleset workers fail to start with session conflict errors:
    ERROR: failed to handle scale set create operation
    error: error starting scale set worker: creating message session:
    "The actions runner scaleset already has an active session."
    
  4. Verify scaleset worker goroutines are not running:
    curl "https://$GARM_URL/debug/pprof/goroutine?debug=2" | grep "scaleset-worker"
    # Should return nothing - workers failed to start
  5. Observe on GitHub Actions that:
    • New workflow jobs queue indefinitely
    • No runners are being provisioned
    • Jobs remain in "Queued" state waiting for runners that never appear

Current Workaround

Manual intervention required - toggle scalesets via API to force fresh session creation:

# Disable scalesets
curl -X PUT "https://$GARM_URL/api/v1/scalesets/$ID" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"enabled": false}'

# Wait a moment for cleanup
sleep 3

# Re-enable scalesets
curl -X PUT "https://$GARM_URL/api/v1/scalesets/$ID" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"enabled": true}'

This forces GARM to:

  1. Stop the old (failed) workers
  2. Wait for GitHub to clean up stale sessions
  3. Create fresh workers with new sessions
  4. Resume normal operations including instance cleanup

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions