Skip to content

[new-protocol] cert2 persistence, versioning, transaction submission#4230

Open
imabdulbasit wants to merge 14 commits intoab/4222-proposal-fetchingfrom
ab/4222-cert2-persistence
Open

[new-protocol] cert2 persistence, versioning, transaction submission#4230
imabdulbasit wants to merge 14 commits intoab/4222-proposal-fetchingfrom
ab/4222-cert2-persistence

Conversation

@imabdulbasit
Copy link
Copy Markdown
Contributor

@imabdulbasit imabdulbasit commented Apr 27, 2026

  • Adds Cert2 persistence
  • Adds NewProtocolStorage
  • Updates node event consumers/persistence to receive CoordinatorEvent
  • Adds protocol version v7 and v8
  • new-protocol request_proposal routing through client_api.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates the new protocol into the Espresso node by introducing a unified CoordinatorEvent system and implementing persistence for new protocol-specific data such as cert2. Key updates include modifying the ConsensusHandle and SequencerContext to manage the new coordinator, updating the query service to handle new decide events, and adding database migrations for the new certificate table. Feedback includes addressing critical issues where DA and quorum proposals are signed with empty byte arrays in the storage layer, resulting in invalid signatures. Additionally, an empty TODO comment should be addressed, and a previously removed explanatory comment regarding database optimization in the SQL persistence layer should be restored for better code maintainability.

epoch: Some(epoch),
epoch_transition_indicator: EpochTransitionIndicator::NotInTransition,
};
let Ok(signature) = T::SignatureKey::sign(&private_key, &[]) else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The DA proposal is being signed with an empty byte array (&[]) instead of the actual proposal data. This will result in an invalid signature if it's ever verified (e.g., during node recovery or state synchronization). The signature should be generated over the serialized data or the appropriate hash.

next_drb_result: proposal.next_drb_result,
state_cert: proposal.state_cert,
};
let Ok(signature) = T::SignatureKey::sign(&private_key, &[]) else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The quorum proposal is being signed with an empty byte array (&[]) instead of the actual proposal data. Similar to the DA proposal signature, this will lead to invalid data in storage that cannot be verified later.

event_sender: async_broadcast::Sender<CoordinatorEvent<T>>,
) {
coordinator.start().await;
//TODO:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO comment is empty and should either be completed with the intended task or removed if it's no longer relevant.

Comment thread crates/espresso/node/src/persistence/sql.rs
@github-actions
Copy link
Copy Markdown
Contributor

Unable to build without Cargo.lock file.

This means that after this change 3rd party projects may have
difficulties using crates in this repo as a dependency. If this
isn't easy to fix please open an issue so we can fix it later.

For the first failing build see: https://github.com/EspressoSystems/espresso-network/actions/runs/25020413220

To reproduce locally run

cargo generate-lockfile
cargo check --all-targets

This PR can still be merged.

@imabdulbasit imabdulbasit changed the title Ab/4222 cert2 persistence [new-protocol] cert2 persistence, versioning, transaction submission Apr 27, 2026
@imabdulbasit imabdulbasit marked this pull request as ready for review April 27, 2026 22:02
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 27, 2026

Claude finished @imabdulbasit's task in 2m 57s —— View job


Code Review: cert2 persistence, versioning, transaction submission

  • Read existing review comments and CLAUDE.md
  • Analyze PR diff and changed files
  • Review storage layer (new-protocol storage, persistence)
  • Review consensus handle and context changes
  • Review versioning (v0_7, v0_8) and header/state changes
  • Review coordinator and consensus changes
  • Review migrations and query service updates
  • Post detailed review feedback

Summary

This is a well-structured PR that integrates several key pieces of the new protocol into the Espresso node. The CoordinatorEvent abstraction that unifies legacy HotShot events with new protocol events is a clean design. The cert2 persistence is implemented across all three storage backends (SQL, filesystem, no-storage) as required. The versioning additions (v0_7, v0_8) follow the established pattern correctly.

Key findings

Issues requiring attention:

  1. Empty-byte signature in storage (storage.rs:97,153) — DA and quorum proposals are signed with &[] instead of actual data. Gemini flagged this too. If intentional (signatures never verified from storage), it needs a comment. If not, it's a correctness bug for recovery/sync paths.

  2. Fee charging disabled for new protocol (state.rs:1043-1050) — charge_fees is skipped for >= NEW_PROTOCOL_VERSION with a TODO comment about "builder is unfunded error". This is a significant behavioral change that allows free block submission. Needs clarification on whether this is temporary or by design.

  3. Query API won't update under new protocol (api/update.rs:33-39) — ApiEventConsumer ignores all non-legacy events, including NewDecide. When the new protocol is active, the query service will stop receiving decide events.

  4. Panic in consensus hot path (consensus.rs:929) — expect("cert1 must exist if cert2 exists") in maybe_decide will crash the coordinator task if violated. Should be a graceful error since the coordinator runs in a spawned task.

Minor observations:

  1. Infinite retry loops in storage tasks (storage.rs) — no max retry count or cancellation mechanism. Tasks spin forever at 300ms if storage is permanently unavailable, relying solely on GC to abort them.

  2. File extension mismatch (fs.rs:992,1006) — cert2 files use .txt extension for bincode binary data. Should be .bin for consistency.

  3. Redundant version constants (versions/lib.rs:21) — NEW_PROTOCOL_VERSION and CLIQUENET_VERSION are both version(0, 8). Worth clarifying the relationship.

  4. Context event handling improvement (context.rs) — The refactor to use tokio::join! for parallel persistence/signer/streamer processing is a nice improvement over the previous sequential approach.

What looks good

  • All three persistence backends (SQL, FS, no-storage) are updated consistently for cert2
  • DB migrations are clean and correct for both Postgres and SQLite
  • The ConsensusHandle adapter pattern cleanly dispatches between legacy and new protocol
  • Header versioning for V7/V8 correctly reuses V6 field layout
  • GC in coordinator properly includes the new decided_cert2 table and cert2 storage cleanup
  • The NewDecideEvent struct carries all necessary data (leaves, cert1, cert2, vid_shares) for downstream consumers

Comment on lines +79 to +173
pub fn append_da(
&mut self,
view_number: ViewNumber,
epoch: EpochNumber,
block_payload: T::BlockPayload,
metadata: <T::BlockPayload as BlockPayload<T>>::Metadata,
vid_commit: VidCommitment,
) {
let storage = self.storage.clone();
let private_key = self.private_key.clone();
let handle = spawn(async move {
let data = DaProposal2 {
encoded_transactions: block_payload.encode(),
metadata,
view_number,
epoch: Some(epoch),
epoch_transition_indicator: EpochTransitionIndicator::NotInTransition,
};
let Ok(signature) = T::SignatureKey::sign(&private_key, &[]) else {
error!("failed to sign DA proposal for storage");
return;
};
let proposal = SignedProposal {
data,
signature,
_pd: PhantomData,
};
loop {
match storage.append_da2(&proposal, vid_commit).await {
Ok(()) => return,
Err(err) => {
warn!(%err, "failed to append DA proposal, retrying");
sleep(RETRY_DELAY).await;
},
}
}
});
self.handles.entry(view_number).or_default().push(handle);
}

pub fn append_cert2(&mut self, view: ViewNumber, cert2: Certificate2<T>) {
let storage = self.storage.clone();
let handle = spawn(async move {
loop {
match storage.append_cert2(view, cert2.clone()).await {
Ok(()) => return,
Err(err) => {
warn!(%err, %view, "failed to append cert2, retrying");
sleep(RETRY_DELAY).await;
},
}
}
});
self.handles.entry(view).or_default().push(handle);
}

pub fn append_proposal(&mut self, proposal: Proposal<T>) {
let view = proposal.view_number;
let storage = self.storage.clone();
let private_key = self.private_key.clone();
let handle = spawn(async move {
let data = QuorumProposal2 {
block_header: proposal.block_header,
view_number: proposal.view_number,
epoch: Some(proposal.epoch),
justify_qc: proposal.justify_qc,
next_epoch_justify_qc: None,
upgrade_certificate: proposal.upgrade_certificate,
view_change_evidence: proposal
.view_change_evidence
.map(ViewChangeEvidence2::Timeout),
next_drb_result: proposal.next_drb_result,
state_cert: proposal.state_cert,
};
let Ok(signature) = T::SignatureKey::sign(&private_key, &[]) else {
error!("failed to sign quorum proposal for storage");
return;
};
let signed = SignedProposal {
data,
signature,
_pd: PhantomData,
};
loop {
match storage.append_proposal2(&signed).await {
Ok(()) => return,
Err(err) => {
warn!(%err, "failed to append proposal, retrying");
sleep(RETRY_DELAY).await;
},
}
}
});
self.handles.entry(view).or_default().push(handle);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The append_da and append_proposal methods both sign with &[] instead of the actual data. I see Gemini also flagged this. To add context on why this matters here specifically:

The append_vid method already does the right thing — it calls vid_share.to_proposal(&private_key), which signs over the actual VID share data. The DA and quorum proposal paths should follow the same pattern.

If these signatures are truly never verified (i.e. the storage layer only cares about having some Proposal wrapper to satisfy the type), this should be documented explicitly with a comment explaining the design choice. Otherwise, any future code path that loads these proposals from storage and attempts signature verification will silently fail.

Is this intentional or a known shortcut for now?

sleep(RETRY_DELAY).await;
},
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The infinite retry loop here (and in append_da, append_proposal, append_vid) will never terminate if the storage is permanently unavailable. These retry tasks are spawned and tracked via JoinHandle, but the only way they stop is via gc() calling handle.abort().

Consider adding either:

  • A max retry count with a final error! log on exhaustion
  • A CancellationToken or watch channel to signal shutdown

Currently if gc never runs for a given view (e.g. due to a bug or lack of checkpoints), these tasks will spin forever at 300ms intervals.

Comment on lines +1043 to +1050
// TODO(abdul): builder is unfunded error
if version < NEW_PROTOCOL_VERSION {
validated_state.charge_fees(
&mut delta,
proposed_header.fee_info(),
chain_config.fee_recipient,
)?;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping fee charging for the new protocol version is a significant behavioral change. The TODO(abdul): builder is unfunded error comment suggests this is a temporary workaround.

A few concerns:

  1. This means blocks in >= NEW_PROTOCOL_VERSION won't charge builder fees at all, which could allow free block submission.
  2. If this is intentional because the new protocol handles fees differently, the comment should say so.
  3. If temporary, consider a more targeted fix (e.g. funding the builder in tests) rather than disabling fee validation in production state transitions.
Suggested change
// TODO(abdul): builder is unfunded error
if version < NEW_PROTOCOL_VERSION {
validated_state.charge_fees(
&mut delta,
proposed_header.fee_info(),
chain_config.fee_recipient,
)?;
}
// The new protocol does not use the legacy builder fee mechanism.
// Fees are handled differently in the new protocol.
if version < NEW_PROTOCOL_VERSION {
validated_state.charge_fees(
&mut delta,
proposed_header.fee_info(),
chain_config.fee_recipient,
)?;
}

Comment on lines +33 to 39
async fn handle_event(&self, event: &CoordinatorEvent<SeqTypes>) -> anyhow::Result<()> {
let CoordinatorEvent::LegacyEvent(hotshot_event) = event else {
return Ok(());
};
if let Err(height) = self.inner.update(hotshot_event).await {
bail!("failed to update API state after {height}: {hotshot_event:?}",);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This event consumer currently silently ignores all non-legacy events (including NewDecide). When the new protocol is active, decide events will come through as CoordinatorEvent::NewDecide, meaning the query API will stop getting updates for decided blocks.

Is this intentional for now? If so, a comment explaining that new-protocol decide handling will be added later would help prevent confusion.

pub const DA_UPGRADE_VERSION: Version = version(0, 6);
pub const VID2_UPGRADE_VERSION: Version = version(0, 7);
pub const CLIQUENET_VERSION: Version = version(0, 8);
pub const NEW_PROTOCOL_VERSION: Version = version(0, 8);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEW_PROTOCOL_VERSION and CLIQUENET_VERSION are both version(0, 8). Are these intentionally the same? If so, one of them may be redundant. If they're expected to diverge later (e.g. cliquenet ships at 0.8 but new protocol activates at a later version), a comment clarifying the relationship would help.

|_| Ok(true),
|mut file| {
let bytes = bincode::serialize(&cert2).context("serialize cert2")?;
file.write_all(&bytes)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: using .txt extension for bincode-serialized binary data is misleading. Consider .bin or no extension, consistent with how other persistence files are stored in this backend.

Suggested change
file.write_all(&bytes)?;
let file_path = dir_path.join(view.u64().to_string()).with_extension("bin");

return Ok(None);
}
let bytes = fs::read(&file_path).context("read cert2")?;
Ok(Some(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load_cert2 path should match the extension used in append_cert2.

Suggested change
Ok(Some(
let file_path = dir_path.join(view.u64().to_string()).with_extension("bin");

//TODO:
let mut cur_view = ViewNumber::new(0);
loop {
match coordinator.next_consensus_input().await {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maybe_decide method in consensus.rs:929 has expect("cert1 must exist if cert2 exists") — this is a strong invariant. If it's violated (e.g. due to a network partition or gc race), the coordinator panics. Consider returning an error or logging and skipping instead, since the coordinator runs in a spawned task and a panic there would silently kill consensus.

@github-actions
Copy link
Copy Markdown
Contributor

Nextest failures (4) in this run

Test Attempts Time (s) Main history
hotshot-new-protocol::tests::failures::ten_nodes_one_down_with_epochs_memory 3 56.25 passing
hotshot-new-protocol::tests::integration::test_leader_proposes_with_computed_drb_in_epoch3 3 16.77 passing
hotshot-new-protocol::tests::integration::test_node_votes_with_computed_drb_in_epoch3 3 37.82 passing
hotshot-new-protocol::tests::memory_network::five_nodes_decide_same_chain_over_memory_network 3 61.95 flaky 3/3

See the step summary for flaky tests and slowest tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants