[new-protocol] cert2 persistence, versioning, transaction submission#4230
[new-protocol] cert2 persistence, versioning, transaction submission#4230imabdulbasit wants to merge 14 commits intoab/4222-proposal-fetchingfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request integrates the new protocol into the Espresso node by introducing a unified CoordinatorEvent system and implementing persistence for new protocol-specific data such as cert2. Key updates include modifying the ConsensusHandle and SequencerContext to manage the new coordinator, updating the query service to handle new decide events, and adding database migrations for the new certificate table. Feedback includes addressing critical issues where DA and quorum proposals are signed with empty byte arrays in the storage layer, resulting in invalid signatures. Additionally, an empty TODO comment should be addressed, and a previously removed explanatory comment regarding database optimization in the SQL persistence layer should be restored for better code maintainability.
| epoch: Some(epoch), | ||
| epoch_transition_indicator: EpochTransitionIndicator::NotInTransition, | ||
| }; | ||
| let Ok(signature) = T::SignatureKey::sign(&private_key, &[]) else { |
There was a problem hiding this comment.
The DA proposal is being signed with an empty byte array (&[]) instead of the actual proposal data. This will result in an invalid signature if it's ever verified (e.g., during node recovery or state synchronization). The signature should be generated over the serialized data or the appropriate hash.
| next_drb_result: proposal.next_drb_result, | ||
| state_cert: proposal.state_cert, | ||
| }; | ||
| let Ok(signature) = T::SignatureKey::sign(&private_key, &[]) else { |
| event_sender: async_broadcast::Sender<CoordinatorEvent<T>>, | ||
| ) { | ||
| coordinator.start().await; | ||
| //TODO: |
|
Unable to build without Cargo.lock file. This means that after this change 3rd party projects may have For the first failing build see: https://github.com/EspressoSystems/espresso-network/actions/runs/25020413220 To reproduce locally run This PR can still be merged. |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Claude finished @imabdulbasit's task in 2m 57s —— View job Code Review: cert2 persistence, versioning, transaction submission
SummaryThis is a well-structured PR that integrates several key pieces of the new protocol into the Espresso node. The Key findingsIssues requiring attention:
Minor observations:
What looks good
|
| pub fn append_da( | ||
| &mut self, | ||
| view_number: ViewNumber, | ||
| epoch: EpochNumber, | ||
| block_payload: T::BlockPayload, | ||
| metadata: <T::BlockPayload as BlockPayload<T>>::Metadata, | ||
| vid_commit: VidCommitment, | ||
| ) { | ||
| let storage = self.storage.clone(); | ||
| let private_key = self.private_key.clone(); | ||
| let handle = spawn(async move { | ||
| let data = DaProposal2 { | ||
| encoded_transactions: block_payload.encode(), | ||
| metadata, | ||
| view_number, | ||
| epoch: Some(epoch), | ||
| epoch_transition_indicator: EpochTransitionIndicator::NotInTransition, | ||
| }; | ||
| let Ok(signature) = T::SignatureKey::sign(&private_key, &[]) else { | ||
| error!("failed to sign DA proposal for storage"); | ||
| return; | ||
| }; | ||
| let proposal = SignedProposal { | ||
| data, | ||
| signature, | ||
| _pd: PhantomData, | ||
| }; | ||
| loop { | ||
| match storage.append_da2(&proposal, vid_commit).await { | ||
| Ok(()) => return, | ||
| Err(err) => { | ||
| warn!(%err, "failed to append DA proposal, retrying"); | ||
| sleep(RETRY_DELAY).await; | ||
| }, | ||
| } | ||
| } | ||
| }); | ||
| self.handles.entry(view_number).or_default().push(handle); | ||
| } | ||
|
|
||
| pub fn append_cert2(&mut self, view: ViewNumber, cert2: Certificate2<T>) { | ||
| let storage = self.storage.clone(); | ||
| let handle = spawn(async move { | ||
| loop { | ||
| match storage.append_cert2(view, cert2.clone()).await { | ||
| Ok(()) => return, | ||
| Err(err) => { | ||
| warn!(%err, %view, "failed to append cert2, retrying"); | ||
| sleep(RETRY_DELAY).await; | ||
| }, | ||
| } | ||
| } | ||
| }); | ||
| self.handles.entry(view).or_default().push(handle); | ||
| } | ||
|
|
||
| pub fn append_proposal(&mut self, proposal: Proposal<T>) { | ||
| let view = proposal.view_number; | ||
| let storage = self.storage.clone(); | ||
| let private_key = self.private_key.clone(); | ||
| let handle = spawn(async move { | ||
| let data = QuorumProposal2 { | ||
| block_header: proposal.block_header, | ||
| view_number: proposal.view_number, | ||
| epoch: Some(proposal.epoch), | ||
| justify_qc: proposal.justify_qc, | ||
| next_epoch_justify_qc: None, | ||
| upgrade_certificate: proposal.upgrade_certificate, | ||
| view_change_evidence: proposal | ||
| .view_change_evidence | ||
| .map(ViewChangeEvidence2::Timeout), | ||
| next_drb_result: proposal.next_drb_result, | ||
| state_cert: proposal.state_cert, | ||
| }; | ||
| let Ok(signature) = T::SignatureKey::sign(&private_key, &[]) else { | ||
| error!("failed to sign quorum proposal for storage"); | ||
| return; | ||
| }; | ||
| let signed = SignedProposal { | ||
| data, | ||
| signature, | ||
| _pd: PhantomData, | ||
| }; | ||
| loop { | ||
| match storage.append_proposal2(&signed).await { | ||
| Ok(()) => return, | ||
| Err(err) => { | ||
| warn!(%err, "failed to append proposal, retrying"); | ||
| sleep(RETRY_DELAY).await; | ||
| }, | ||
| } | ||
| } | ||
| }); | ||
| self.handles.entry(view).or_default().push(handle); | ||
| } |
There was a problem hiding this comment.
The append_da and append_proposal methods both sign with &[] instead of the actual data. I see Gemini also flagged this. To add context on why this matters here specifically:
The append_vid method already does the right thing — it calls vid_share.to_proposal(&private_key), which signs over the actual VID share data. The DA and quorum proposal paths should follow the same pattern.
If these signatures are truly never verified (i.e. the storage layer only cares about having some Proposal wrapper to satisfy the type), this should be documented explicitly with a comment explaining the design choice. Otherwise, any future code path that loads these proposals from storage and attempts signature verification will silently fail.
Is this intentional or a known shortcut for now?
| sleep(RETRY_DELAY).await; | ||
| }, | ||
| } | ||
| } |
There was a problem hiding this comment.
The infinite retry loop here (and in append_da, append_proposal, append_vid) will never terminate if the storage is permanently unavailable. These retry tasks are spawned and tracked via JoinHandle, but the only way they stop is via gc() calling handle.abort().
Consider adding either:
- A max retry count with a final
error!log on exhaustion - A
CancellationTokenorwatchchannel to signal shutdown
Currently if gc never runs for a given view (e.g. due to a bug or lack of checkpoints), these tasks will spin forever at 300ms intervals.
| // TODO(abdul): builder is unfunded error | ||
| if version < NEW_PROTOCOL_VERSION { | ||
| validated_state.charge_fees( | ||
| &mut delta, | ||
| proposed_header.fee_info(), | ||
| chain_config.fee_recipient, | ||
| )?; | ||
| } |
There was a problem hiding this comment.
Skipping fee charging for the new protocol version is a significant behavioral change. The TODO(abdul): builder is unfunded error comment suggests this is a temporary workaround.
A few concerns:
- This means blocks in
>= NEW_PROTOCOL_VERSIONwon't charge builder fees at all, which could allow free block submission. - If this is intentional because the new protocol handles fees differently, the comment should say so.
- If temporary, consider a more targeted fix (e.g. funding the builder in tests) rather than disabling fee validation in production state transitions.
| // TODO(abdul): builder is unfunded error | |
| if version < NEW_PROTOCOL_VERSION { | |
| validated_state.charge_fees( | |
| &mut delta, | |
| proposed_header.fee_info(), | |
| chain_config.fee_recipient, | |
| )?; | |
| } | |
| // The new protocol does not use the legacy builder fee mechanism. | |
| // Fees are handled differently in the new protocol. | |
| if version < NEW_PROTOCOL_VERSION { | |
| validated_state.charge_fees( | |
| &mut delta, | |
| proposed_header.fee_info(), | |
| chain_config.fee_recipient, | |
| )?; | |
| } |
| async fn handle_event(&self, event: &CoordinatorEvent<SeqTypes>) -> anyhow::Result<()> { | ||
| let CoordinatorEvent::LegacyEvent(hotshot_event) = event else { | ||
| return Ok(()); | ||
| }; | ||
| if let Err(height) = self.inner.update(hotshot_event).await { | ||
| bail!("failed to update API state after {height}: {hotshot_event:?}",); | ||
| } |
There was a problem hiding this comment.
This event consumer currently silently ignores all non-legacy events (including NewDecide). When the new protocol is active, decide events will come through as CoordinatorEvent::NewDecide, meaning the query API will stop getting updates for decided blocks.
Is this intentional for now? If so, a comment explaining that new-protocol decide handling will be added later would help prevent confusion.
| pub const DA_UPGRADE_VERSION: Version = version(0, 6); | ||
| pub const VID2_UPGRADE_VERSION: Version = version(0, 7); | ||
| pub const CLIQUENET_VERSION: Version = version(0, 8); | ||
| pub const NEW_PROTOCOL_VERSION: Version = version(0, 8); |
There was a problem hiding this comment.
NEW_PROTOCOL_VERSION and CLIQUENET_VERSION are both version(0, 8). Are these intentionally the same? If so, one of them may be redundant. If they're expected to diverge later (e.g. cliquenet ships at 0.8 but new protocol activates at a later version), a comment clarifying the relationship would help.
| |_| Ok(true), | ||
| |mut file| { | ||
| let bytes = bincode::serialize(&cert2).context("serialize cert2")?; | ||
| file.write_all(&bytes)?; |
There was a problem hiding this comment.
Nit: using .txt extension for bincode-serialized binary data is misleading. Consider .bin or no extension, consistent with how other persistence files are stored in this backend.
| file.write_all(&bytes)?; | |
| let file_path = dir_path.join(view.u64().to_string()).with_extension("bin"); |
| return Ok(None); | ||
| } | ||
| let bytes = fs::read(&file_path).context("read cert2")?; | ||
| Ok(Some( |
There was a problem hiding this comment.
The load_cert2 path should match the extension used in append_cert2.
| Ok(Some( | |
| let file_path = dir_path.join(view.u64().to_string()).with_extension("bin"); |
| //TODO: | ||
| let mut cur_view = ViewNumber::new(0); | ||
| loop { | ||
| match coordinator.next_consensus_input().await { |
There was a problem hiding this comment.
The maybe_decide method in consensus.rs:929 has expect("cert1 must exist if cert2 exists") — this is a strong invariant. If it's violated (e.g. due to a network partition or gc race), the coordinator panics. Consider returning an error or logging and skipping instead, since the coordinator runs in a spawned task and a panic there would silently kill consensus.
Nextest failures (4) in this run
See the step summary for flaky tests and slowest tests. |
Uh oh!
There was an error while loading. Please reload this page.