Skip to content

fix(tx-generator): pre-submit chain-tip probe to prevent duplicate-submit after reconnect #111

@paolino

Description

@paolino

Summary

After #110 closed the stale-UTxO window post-reconnect, Antithesis still surfaces both tx_generator_refill_submit_rejected and tx_generator_transact_submit_rejected Always-assertions. The relay's rejection reason — captured verbatim from the report — is not a stale-input rejection; it's a duplicate-submit-after-reconnect:

ConwayMempoolFailure "All inputs are spent. Transaction has probably already been included"

Source: https://cardano.antithesis.com/report/tilehuSggX4cnuy5qyXfwpqI/2ZUJSYUipLqm3Dlbo9R3rjrS7i7dYJ_mc8FwikFqYLg.html — example vtime=182.072s, id=tx_generator_refill_submit_rejected.

Root cause

The race, mapped to the daemon's submit path in lib/Cardano/Node/Client/TxGenerator/Daemon.hs:

Step Code What happens
1 queryUTxOs provider faucetAddr (refill) / queryUTxOs provider srcAddr (transact) LSQ returns UTxO X — current under the indexer's view, also unspent on the relay's chain at this moment
2 refillTx / transactTx Daemon builds Tx1 with X as input
3 submit submitter signed (i.e. submitTxN2C via the LTxS channel) Wire write to relay succeeds; relay accepts Tx1 into its mempool
4 Bearer dies before MsgAcceptTx round-trips back BlockedIndefinitelyOnSTM → caught in LocalTxSubmission.submitTxN2C, re-raised as ConnectionLost
5 Daemon arm's E.handle ConnectionLost returns RefillFail/TransactFail IndexNotReady (today) Composer treats the tick as not-applicable, retries on the next tick
6 Supervisor reconnects, rsIndexFresh clears, then flips true on the next chain-sync block Tx1 may or may not have been included in the new chain head by then
7 Composer fires another refill/transact Daemon arm runs queryUTxOs again, gets X back (the indexer's UTxO view either hasn't observed Tx1's effect yet, or rolled back through it)
8 Daemon builds Tx2 with the same input X Different TxId from Tx1 because of seed/randomness, but same input
9 Submit Tx2 Relay's chain has Tx1 included → X is spent → ConwayMempoolFailure "All inputs are spent..."

The freshness gate (#109/#110) helps step 7 only when the indexer can re-sync within one block of the prior submission landing. Under aggressive fault injection (4093 disconnect/reconnect cycles in 1h, per 685fa5e run), bursts of reconnects within a single block window leave the gate insufficient.

This is the standard on-chain-tx-submission idempotency problem: from the daemon's local perspective, a submit that elicited ConnectionLost is indeterminate — it might have landed, or it might not.

Recommended fix (single, minimal-scope path)

Pre-submit chain-tip query: before calling submit submitter signed in both runRefillArm and runTransactArm, verify that the chosen input(s) are still unspent against the relay's current chain tip via LSQ.

Why this is the right first step

  • Closes the dominant window: between the prior submit's landing and the next arm's tx-build, the daemon now sees a freshly-landed Tx1's effect on chain and refuses to spend X again.
  • Cheap: one extra GetUTxOByTxIn LSQ round-trip per submit attempt. LSQ is already in our N2C plumbing.
  • Fail-safe: if the input isn't on the relay's current view, treat it as IndexNotReady and let the composer retry — same wire-stable response we already use.
  • Doesn't require remembering in-flight txs across reconnects, which would otherwise need new persistence.
  • Bisect-safe: pure addition; no existing callers change behavior under the happy path.

Specific changes

  1. New helper in lib/Cardano/Node/Client/TxGenerator/Selection.hs (or a dedicated Submit.hs if it grows):

    -- | Verify each input is still unspent at the relay's current
    -- volatile tip. Single LSQ round-trip via @GetUTxOByTxIn@.
    verifyInputsUnspent ::
        Provider IO ->
        Set TxIn ->
        IO Bool

    Returns False if any input is missing from the tip's UTxO set.

  2. Wire into lib/Cardano/Node/Client/TxGenerator/Daemon.hs in two sites:

    • buildSignSubmit (refill path, around the existing submit submitter signed call) — guards the single faucet input.
    • The transact path's submit site — guards the K source inputs.

    On verifyInputsUnspent → False, return RefillFail/TransactFail IndexNotReady without incrementing the next-HD-index. Same retry semantics as the existing IndexNotReady paths.

  3. Tests:

    • test/Cardano/Node/Client/E2E/TxGeneratorSubmitIdempotenceSpec.hs — boot devnet via withRestartableCardanoNode, drive a refill, restart relay, drive a SECOND refill that would otherwise re-submit the same input, assert no ApplyTxErr carrying "already been included", assert daemon process stays alive.
    • Unit test in test/Cardano/Node/Client/TxGenerator/SelectionSpec.hs — pure check that verifyInputsUnspent correctly returns False when the LSQ stub omits a queried input.

Out of scope for this issue

Acceptance

A single Antithesis 1h cardano_node_tx_generator run on the downstream bump PR (cardano-foundation/cardano-node-antithesis#98), against a pin that includes this fix, shows:

  • 0 tx_generator_refill_submit_rejected Always-assertion failures.
  • 0 tx_generator_transact_submit_rejected Always-assertion failures.
  • The supervisor still triggers ≥3000 Disconnected/Reconnecting events (i.e. fault injection wasn't softened — we eliminated the false positive at the daemon side, not by reducing chaos).

Plus the new E2E spec passes locally:

nix develop -c cabal test e2e-tests \
  --test-options='--match "tx-generator submit idempotence"'

Related

Metadata

Metadata

Assignees

Labels

fixBug fix

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions