Skip to content

fix(health): unauthenticated liveness probe for authorization_code gateways + POST ping for StreamableHTTP#4496

Open
ecthelion77 wants to merge 1 commit intoIBM:mainfrom
forterro:fix/streamablehttp-health-check-liveness
Open

fix(health): unauthenticated liveness probe for authorization_code gateways + POST ping for StreamableHTTP#4496
ecthelion77 wants to merge 1 commit intoIBM:mainfrom
forterro:fix/streamablehttp-health-check-liveness

Conversation

@ecthelion77
Copy link
Copy Markdown
Contributor

Fixes #4495

Summary

The gateway health check marks all authorization_code OAuth gateways as unreachable after pod restarts because the system account (PLATFORM_ADMIN_EMAIL) has no stored OAuth tokens. This PR introduces an unauthenticated liveness probe strategy: when no token is available, the health check proceeds without auth — HTTP 401/403 proves the server is alive, while timeouts/DNS/connection errors indicate a real outage.

Additionally, replaces the full MCP SDK ClientSession.initialize() for StreamableHTTP health checks with a lightweight JSON-RPC POST initialize request. The SDK opens a GET SSE stream after initialize which returns 405 on servers that don't support server-initiated messages (the MCP spec says GET is optional).

Changes

mcpgateway/services/gateway_service.py_check_single_gateway_health():

1. Unauthenticated liveness probe (unauthenticated_probe flag)

For authorization_code gateways, the health check now:

  • Tries to use a stored token if available (existing behavior)
  • If no token exists (typical for system accounts): proceeds without auth instead of calling _handle_gateway_failure()
  • Sets unauthenticated_probe = True so that HTTP 401/403 responses are treated as "server alive" for both SSE and StreamableHTTP transports
  • Timeouts, DNS failures, and connection errors still trigger _handle_gateway_failure as before

Previously, the code did an early return to _handle_gateway_failure() when user_email was missing or when no stored token was found, making it impossible for authorization_code gateways to pass health checks under a system account.

2. Lightweight StreamableHTTP health check

Replaced the full SDK client session:

# Before (causes 405 on servers without server-initiated messages):
async with streamablehttp_client(url=...) as (read, write, _):
    async with ClientSession(read, write) as session:
        response = await session.initialize()

# After (lightweight POST only):
response = await client.post(url, json=init_payload, headers=..., timeout=...)

The MCP SDK ClientSession.initialize() opens a GET SSE stream after the initialize handshake. Servers that don't support server-initiated messages (M365 MCP, Kubernetes MCP, GitHub MCP) return 405 on GET, causing a false health failure. A successful POST initialize is sufficient proof of health per the MCP spec.

3. Observability

Added health.probe_type: unauthenticated_liveness span attribute when probing without auth, for tracing visibility.

Testing

Validated in production with 13 gateways (8 authorization_code + 5 non-OAuth). After deployment:

  • All 13 gateways report reachable=true
  • Previously, all 8 authorization_code gateways were stuck at reachable=false after every pod restart
  • StreamableHTTP gateways (M365, Kubernetes, GitHub) no longer fail with 405

Signed-off-by: Olivier Gintrand [email protected]

Replace the full MCP SDK client session.initialize() with a lightweight
JSON-RPC POST for StreamableHTTP health checks. The SDK opens a GET SSE
stream after initialize which returns 405 on servers that don't support
server-initiated messages (M365, Kubernetes MCP, GitHub MCP).

For authorization_code OAuth gateways, the health check no longer marks
the gateway as failed when the system account has no stored token.
Instead, it proceeds without auth — a 401/403 proves the server is
alive, while timeouts/DNS/connection errors indicate a real outage.

Signed-off-by: Olivier Gintrand <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Health check marks authorization_code OAuth gateways unreachable when PLATFORM_ADMIN_EMAIL has no stored token

1 participant