Skip to content

Conversation

@edsonmichaque
Copy link
Contributor

@edsonmichaque edsonmichaque commented Jan 28, 2026

Description

Data plane gateways fail to load SSL certificates from MDCB during startup in v5.8.0+. This PR fixes the regression by adding exponential backoff retry logic to certificate loading.

Related Issue

https://tyktech.atlassian.net/browse/TT-14618

Motivation and Context

Critical regression from TT-14163 (PR #6910). Gateway starts in emergency mode immediately, blocking certificate fetch before RPC ready. Affects v5.8.0+.

How This Has Been Tested

Unit test in certs/manager_tt14618_test.go demonstrates 6 retry attempts over 0.8s, successfully loading certificate after RPC becomes ready.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I ensured that the documentation is up to date
  • I explained why this PR updates go.mod in detail with reasoning why it's required
  • I would like a code coverage CI quality gate exception and have explained why

Ticket Details

TT-14618
Status Open
Summary [regression]Data plane gateways cannot start if they're configured with a TYK_GW_HTTPSERVEROPTIONS_SSLCERTIFICATES that is in the certificate store

Generated at: 2026-01-28 14:40:14

@edsonmichaque edsonmichaque force-pushed the TT-14618-regression-data-plane-gateways-cannot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store branch from e87a69c to e58f49d Compare January 28, 2026 13:10
@probelabs
Copy link

probelabs bot commented Jan 28, 2026

This PR resolves a critical regression that prevents data plane gateways from starting up correctly when configured to load SSL certificates from a Multi-Data Center Bridge (MDCB). The issue stems from a race condition where the gateway attempts to fetch certificates before its RPC connection to MDCB is fully established, resulting in a connection error and startup failure.

The fix introduces a resilient retry mechanism using exponential backoff. When the certificate manager encounters an ErrMDCBConnectionLost error during startup, it will now retry fetching the certificate for up to 30 seconds. This provides a sufficient window for the RPC connection to initialize. A key performance optimization ensures this backoff logic is only triggered for the first certificate in a batch, preventing compounded delays when loading multiple certificates and ensuring a fast startup even with hundreds of certs.

Files Changed Analysis

  • certs/manager.go: The core logic in the List function has been modified to handle storage.ErrMDCBConnectionLost. It now incorporates the cenkalti/backoff/v4 library to implement the exponential backoff. A new skipBackoff flag is used to ensure the full backoff delay is applied only once per batch, with subsequent intermittent failures handled by a quick, single retry.
  • certs/manager_tt14618_test.go: A comprehensive new test file has been added to validate the fix. It uses a mock storage layer to simulate the startup race condition and flaky network connections. The tests verify that the backoff is triggered correctly, certificates are loaded successfully after retries, and the optimization for multiple certificates works efficiently under scale (tested with up to 1000 certificates).

Architecture & Impact Assessment

  • What this PR accomplishes: It resolves a critical startup failure for gateways that depend on MDCB for SSL certificate management, making the data plane more resilient and reliable in distributed deployments.
  • Key technical changes introduced: The primary change is the introduction of a blocking exponential backoff retry mechanism to handle transient MDCB connection errors during the gateway's startup. A stateful flag (skipBackoff) is used within the certificate loading loop to optimize this process for multiple certificates.
  • Affected system components: The change is localized to the CertificateManager within the certs package. Its impact is most significant on the Tyk Gateway's startup sequence, particularly in environments that utilize the MDCB feature.

Certificate Loading Flow with Retry Logic

sequenceDiagram
    participant Gateway as Gateway Startup
    participant CertManager as CertificateManager
    participant Storage as MDCB Storage

    Gateway->>CertManager: List(["cert-1", "cert-2"])
    CertManager->>Storage: GetKey("raw-cert-1")
    Storage-->>CertManager: Fail (ErrMDCBConnectionLost)
    
    Note over CertManager: MDCB not ready, start exponential backoff.

    loop Retry Loop (up to 30s)
        CertManager->>Storage: GetKey("raw-cert-1")
        Storage-->>CertManager: Fail (ErrMDCBConnectionLost)
    end

    CertManager->>Storage: GetKey("raw-cert-1")
    Storage-->>CertManager: Success (Certificate Data)
    Note over CertManager: MDCB connection verified. skipBackoff = true.

    CertManager->>Storage: GetKey("raw-cert-2")
    Storage-->>CertManager: Success (Certificate Data)
    CertManager-->>Gateway: Return Certificates
Loading

Scope Discovery & Context Expansion

  • The change is well-contained within the certs package and is triggered specifically by the storage.ErrMDCBConnectionLost error, which originates from the RPC storage handler when the gateway is in "emergency mode" at startup.
  • While this PR addresses the immediate issue for SSL certificates, it highlights a potential pattern of race conditions for any component that fetches critical configuration from MDCB at startup. A broader review of the codebase could identify other areas (e.g., loading policies, API definitions) that might benefit from a similar resilience mechanism to prevent other startup failures related to MDCB dependency.
Metadata
  • Review Effort: 3 / 5
  • Primary Label: bug

Powered by Visor from Probelabs

Last updated: 2026-01-28T14:42:44.877Z | Triggered by: pr_updated | Commit: f413120

💡 TIP: You can chat with Visor using /visor ask <your question>

@edsonmichaque edsonmichaque force-pushed the TT-14618-regression-data-plane-gateways-cannot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store branch from e58f49d to 5abca3d Compare January 28, 2026 13:11
@probelabs
Copy link

probelabs bot commented Jan 28, 2026

✅ Security Check Passed

No security issues found – changes LGTM.

Architecture Issues (1)

Severity Location Issue
🟢 Info certs/manager.go:416-447
The exponential backoff logic to handle MDCB connection unavailability during startup is specific to certificate loading. Other components that rely on MDCB for critical configuration at startup may face similar race conditions. This robust connection handling pattern could be beneficial if applied more broadly.
💡 SuggestionConsider abstracting this retry logic into a reusable component or wrapper around the storage client for other critical startup procedures that depend on MDCB. This would improve the overall resilience of the gateway's startup sequence in a consistent manner.

Performance Issues (1)

Severity Location Issue
🟡 Warning certs/manager.go:449
A hardcoded `time.Sleep` is used to handle intermittent connection failures after an initial successful connection. This blocks the current goroutine for a fixed 100ms for every certificate that fails in this scenario. If the MDCB connection is highly unstable, this could lead to a significant cumulative delay during startup (e.g., 50 failures in a batch of 100 certificates would add 5 seconds of synchronous delay).
💡 SuggestionTo make the flaky connection handling more robust and less impactful, consider replacing the fixed sleep with a more dynamic, short-lived retry mechanism. For instance, using the `backoff` library for a constant backoff with a limited number of retries (e.g., 2 retries with a 50ms interval) would be more resilient to transient blips without introducing a long, hardcoded pause. This would look something like `backoff.Retry(quickRetryOperation, backoff.WithMaxRetries(backoff.NewConstantBackOff(50*time.Millisecond), 2))`.

Quality Issues (1)

Severity Location Issue
🟡 Warning certs/manager_tt14618_test.go:301-471
The mock storage structs `MockMDCBStorage` and `MockFlakyMDCBStorage` contain a large amount of duplicated boilerplate code for unimplemented methods of the `storage.Storage` interface. This violates the DRY (Don't Repeat Yourself) principle and makes the test file harder to maintain.
💡 SuggestionTo reduce duplication, create a base mock struct that provides default 'not implemented' stubs for the `storage.Storage` interface. The specific mock structs (`MockMDCBStorage`, `MockFlakyMDCBStorage`) can then embed this base struct, overriding only the methods relevant to the test case (like `GetKey`). This will make the mocks cleaner and easier to manage.

Powered by Visor from Probelabs

Last updated: 2026-01-28T14:42:48.378Z | Triggered by: pr_updated | Commit: f413120

💡 TIP: You can chat with Visor using /visor ask <your question>

@edsonmichaque edsonmichaque force-pushed the TT-14618-regression-data-plane-gateways-cannot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store branch 4 times, most recently from 57b6ac3 to a34d904 Compare January 28, 2026 13:25
@edsonmichaque edsonmichaque marked this pull request as ready for review January 28, 2026 13:28
@edsonmichaque edsonmichaque reopened this Jan 28, 2026
@github-actions
Copy link
Contributor

API Changes

no api changes detected

@edsonmichaque edsonmichaque force-pushed the TT-14618-regression-data-plane-gateways-cannot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store branch 5 times, most recently from 7926619 to 0eb5250 Compare January 28, 2026 13:57
## Problem
Data plane gateways fail to load SSL certificates from MDCB during
startup in v5.8.0+. This regression was introduced by TT-14163 which
changed the RPC initialization to start in emergency mode immediately,
causing certificate loading to be blocked before RPC connection is ready.

## Root Cause
- TT-14163 sets emergency mode = true at startup
- Certificate loading happens before RPC connection completes
- Emergency mode check blocks certificate fetch from MDCB
- Certificates are frozen into TLS config (cannot reload later)

## Solution
Add exponential backoff retry in certificate loading (certs/manager.go).
When certificate fetch fails with ErrMDCBConnectionLost, retry with
exponential backoff (100ms initial, 2s max, 30s timeout) until RPC ready.

## Changes
- certs/manager.go: Added retry logic with exponential backoff
- certs/manager_tt14618_test.go: Unit test demonstrating the fix

## Benefits
- Preserves TT-14163 fix (no crash loops in K8s)
- Certificates load correctly when MDCB available
- Self-healing on RPC connection delays
- Minimal code change (isolated to certificate loading)

## Testing
Unit test shows 6 retry attempts over ~0.8s, successfully loading
certificate after RPC becomes ready.

Related: TT-14163, PR #6910
@edsonmichaque edsonmichaque force-pushed the TT-14618-regression-data-plane-gateways-cannot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store branch from 0eb5250 to 2a332a6 Compare January 28, 2026 14:14
Tests certificate loading with 1000 certificates when MDCB
is temporarily unavailable at startup.

Results:
- 2.082 seconds to load 1000 certificates
- 2.1ms average per certificate
- 480x faster than unoptimized approach
- Without optimization: would take ~16.7 minutes

Proves the mdcbVerified flag optimization scales to
enterprise deployments with many certificates.
The variable name 'mdcbVerified' was ambiguous about what it
actually did. Renamed to 'skipBackoff' which clearly indicates
its purpose: skip full exponential backoff after first successful
MDCB verification.

Changes:
- Renamed variable from mdcbVerified to skipBackoff
- Updated comment to be more descriptive
- Clarified else branch comment
- No functional changes, tests still pass
Added three benchmark suites to measure performance:

1. BenchmarkTT14618_CertificateLoading - Different scales:
   - 1 cert/3 failures:    263ms
   - 10 certs/3 failures:  256ms (skipBackoff working!)
   - 100 certs/5 failures: 786ms
   - 1000 certs/5 failures: 1.3s

2. BenchmarkTT14618_OptimizedVsUnoptimized - 100 certs:
   - Optimized (batch):     253ms
   - Unoptimized (single):  278ms
   - Shows single backoff vs per-cert backoff

3. BenchmarkTT14618_CacheHit - Cache performance:
   - 871 ns/op for 5 cached certificates
   - Demonstrates sub-microsecond cache lookups

Key insight: skipBackoff prevents compounded delays - loading
10 certs takes same time as 1 cert (~256ms).
@github-actions
Copy link
Contributor

🚨 Jira Linter Failed

Commit: f413120
Failed at: 2026-01-28 14:40:16 UTC

The Jira linter failed to validate your PR. Please check the error details below:

🔍 Click to view error details
failed to validate Jira issue: jira ticket TT-14618 has status 'Open' but must be one of: In Dev, In Code Review, Ready For Dev, Dod Check

Next Steps

  • Ensure your branch name contains a valid Jira ticket ID (e.g., ABC-123)
  • Verify your PR title matches the branch's Jira ticket ID
  • Check that the Jira ticket exists and is accessible

This comment will be automatically deleted once the linter passes.

@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants