Skip to content

bug: DAPI nodes get banned during SPV transient when quorum key not yet available #832

@lklimek

Description

@lklimek

Summary

When the SPV client is in a transient state where the masternode/quorum cache hasn't caught up (e.g. during mid-session wallet import, cold start, network switch, or wake-from-sleep), any DAPI-backed backend task can cause all connected DAPI nodes to be banned in rapid succession. The wallet's DAPI connectivity becomes unusable until bans expire or SPV catches up and proofs can be re-verified.

Impact

  • User-observable: "Platform unavailable" / repeated DAPI timeouts for seconds to minutes after an SPV state transition.
  • Amplified by recent mid-session wallet import handling (PR #830's programmatic SpvManager::restart()) because the restart deliberately re-initialises SPV state while the UI stays up and user actions can continue to fire DAPI traffic.
  • Not specific to PR fix(spv): atomically extend BIP44 receive pool on wallet import #830 — the bug fires for any SPV transient that triggers proof verification while the quorum cache is empty.

Root cause

The ban happens upstream in rs-dapi-client, not in DET. The chain:

  1. An AppContext::run_backend_task variant that touches Platform dispatches an SDK call.
  2. The SDK receives a proof-carrying response from a DAPI node. To verify the proof it calls SpvProvider::get_quorum_public_key(...) (src/context_provider_spv.rs:89).
  3. SpvProvider delegates to SpvManager::get_quorum_public_key(...) (src/spv/manager.rs:987). If the requested quorum isn't in the in-memory masternode/quorum cache, it returns an error.
  4. Today that error is ContextProviderError::Generic(_) (src/context_provider_spv.rs:107), which bubbles up as drive_proof_verifier::Error::ContextProviderError(_) and then dash_sdk::error::Error::Proof(_).
  5. rs-sdk's impl CanRetry for Error (platform/packages/rs-sdk/src/error.rs:266-272) classifies Error::Proof(_) as retryable.
  6. update_address_ban_status in platform/packages/rs-dapi-client/src/dapi_client.rs:186-218 treats retryable-but-failed responses as bad-node signals and calls AddressList::ban().

The SDK cannot currently distinguish "remote returned a bad proof" (node is misbehaving, ban is correct) from "my local context couldn't verify the proof because my quorum cache isn't ready yet" (node is fine, retry later). Both paths take the same Error::Proof branch.

Reproduction

  1. Testnet. Launch DET after SPV data has been cleared, so SPV starts syncing from checkpoint.
  2. Import any wallet via GUI or MCP core_wallet_import.
  3. While SPV is still reconciling quorums for recent heights, trigger any DAPI-backed action (viewing identities, opening Platform Info, balance refresh, etc.).
  4. Observe in logs: banned events for each connected DAPI node.
  5. After a short delay (SPV catches up), DAPI requests start working again.

Why existing safeguards don't cover this

  • MCP tools have mcp::resolve::ensure_spv_synced at src/mcp/resolve.rs:117-137 which polls ConnectionStatus::overall_state() == Synced with a 10-minute ceiling before dispatching any wallet-facing MCP tool. This works: MCP paths don't exhibit the ban behaviour.
  • GUI backend-task dispatch has no equivalent gate. AppContext::run_backend_task (src/backend_task/mod.rs:409) goes straight to the SDK. Every GUI action that spawns a DAPI call during a transient contributes to the ban wave.

Proposed fix

Two-part, small surface area:

1. Primary fix — gate DAPI-using backend tasks on SPV readiness (~1 day)

  • Hoist ensure_spv_synced logic out of src/mcp/resolve.rs into a shared helper (e.g. AppContext::await_spv_ready(timeout: Duration) -> Result<(), TaskError>).
  • Call it from AppContext::run_backend_task for every DAPI-using variant before the SDK invocation (audit and list variants in PR body).
  • On timeout, return a new TaskError::SpvNotReady variant with a user-facing message per CLAUDE.md error-messaging rules, e.g. "Background sync is catching up. Try again in a moment." — the existing MessageBanner surfaces this automatically.
  • Non-DAPI variants (SPV-only, local DB reads) continue to bypass the gate.

2. Secondary fix — distinct error variant for future SDK classification (~30 min)

  • At src/context_provider_spv.rs:107, return ContextProviderError::InvalidQuorum(_) (or a new purpose-fit variant upstream, if one is appropriate) instead of Generic(_).
  • No immediate client behaviour change — rs-sdk's CanRetry still treats it as retryable. But the typed discriminator lets a future upstream refinement distinguish "retry without banning" from "ban and retry" without string-matching.

Rejected alternatives (for the record)

  • Pause the SDK during SPV restart — no public pause API in rs-sdk; multi-day build for no additional ban-reduction benefit over gating at the task level.
  • DAPI request queueing shim in DET — duplicates AddressList ordering/retry logic, adds a second source of truth for node health. Poor observability.
  • Proof verification bypass (see Optional proof verification bypass mode (feature request) #283) — user-facing toggle exists there as a dev-mode feature request. Different scope: that one assumes Core is offline and the user explicitly opts out; this one is about transient readiness during normal operation.

Upstream follow-up

A clean long-term fix should happen in platform/packages/rs-sdk:

  • Error::Proof carrying a ContextProviderError::InvalidQuorum variant should be classified as retry-without-ban rather than retry-with-ban-on-failure. rs-dapi-client::update_address_ban_status would honour that classification.
  • Consider a new CanRetry return value (e.g. RetryPreservingAddress) distinct from Retry for this case.

Will file a separate upstream issue against dashpay/platform if maintainers agree the DET-side gate is insufficient on its own.

Related

Relevant paths

File Role
src/context_provider_spv.rs:89-108 Error mapping site (fix location for part 2)
src/spv/manager.rs:987-1046 get_quorum_public_key — returns error when quorum not cached
src/context/connection_status.rs:279-330 overall_state() — readiness signal
src/mcp/resolve.rs:107-137 The gate to hoist
src/backend_task/mod.rs:409-507 Dispatch site to add the gate (fix location for part 1)
src/backend_task/error.rs Add TaskError::SpvNotReady variant here
src/sdk_wrapper.rs:14-20 Current ban_failed_address: true SDK configuration

🤖 Co-authored by Claudius the Magnificent AI Agent

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions