Skip to content

go-ethereum: add freezer safety margin to prevent data loss after unclean shutdown#4506

Draft
joshuacolvin0 wants to merge 1 commit intomasterfrom
freezer-safety-margin
Draft

go-ethereum: add freezer safety margin to prevent data loss after unclean shutdown#4506
joshuacolvin0 wants to merge 1 commit intomasterfrom
freezer-safety-margin

Conversation

@joshuacolvin0
Copy link
Copy Markdown
Member

@joshuacolvin0 joshuacolvin0 commented Mar 15, 2026

pulls in OffchainLabs/go-ethereum#637
fixes NIT-4663

After an unclean shutdown, repair() may truncate the freezer head to
restore cross-table consistency. Previously, blocks were deleted from
the key-value store immediately after freezing, so truncated blocks
could end up missing from both stores — making the node unable to
start (especially for L2 nodes that cannot re-sync pruned blocks from
peers).

Introduce a safety margin (freezerCleanupMargin = freezerBatchLimit)
that retains the most recently frozen blocks in the key-value store.
Since freezeRange reads via nofreezedb (which bypasses the ancient
store), retained blocks can be re-frozen after repair() truncation.

Key changes:

  • Add cleanupMargin field on chainFreezer with persisted cleanup tail
    (freezerCleanupTailKey) so progress resumes across restarts
  • Replace immediate post-freeze deletion with incremental cleanup over
    [cleanupStart, cleanupLimit) using Has()+Get() to distinguish missing
    keys from I/O errors, with backoff on failure
  • Add startup validation in Open(): detect unrecoverable data gaps
    where the freezer has been truncated below the cleanup tail
  • Handle upgrade path (skip-ahead when no tail but frozen >
    FullImmutabilityThreshold) and fresh installs (clean from block 1)
  • Cap per-cycle cleanup to freezerBatchLimit to prevent stalling
  • Bound dangling side chain chase to freezerBatchLimit iterations
  • Add ReadFreezerCleanupTail/WriteFreezerCleanupTail accessors and a
    strict variant for startup/runtime error propagation
  • Surface cleanup tail in ReadChainMetadata diagnostics
  • Add comprehensive test suite (21 tests) covering margin behavior,
    crash recovery, side chain cleanup, boundary conditions, corruption
    detection, upgrade path, and regression guard

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 32.17%. Comparing base (d230899) to head (4e09ccd).
⚠️ Report is 18 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4506      +/-   ##
==========================================
- Coverage   32.66%   32.17%   -0.50%     
==========================================
  Files         495      495              
  Lines       58724    58756      +32     
==========================================
- Hits        19185    18903     -282     
- Misses      36165    36471     +306     
- Partials     3374     3382       +8     

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 15, 2026

❌ 13 Tests Failed:

Tests completed Failed Passed Skipped
4449 13 4436 0
View the top 3 failed tests by shortest run time
TestDataStreaming_PositiveScenario/Many_senders,_long_messages
Stack Traces | 0.050s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
        	/home/runner/work/nitro/nitro/util/testhelpers/testhelpers.go:29 +0x9f
        github.com/offchainlabs/nitro/daprovider/data_streaming.testBasic.func1()
        	/home/runner/work/nitro/nitro/daprovider/data_streaming/protocol_test.go:230 +0x19b
        created by github.com/offchainlabs/nitro/daprovider/data_streaming.testBasic in goroutine 109
        	/home/runner/work/nitro/nitro/daprovider/data_streaming/protocol_test.go:223 +0x85
        
    protocol_test.go:230: �[31;1m [] too much time has elapsed since request was signed �[0;0m
INFO [03-16|20:48:31.030] rpc response                             method=datastreaming_start logId=5  err="too much time has elapsed since request was signed" result={} attempt=0 args="[\"0x69b86c9e\", \"0x2c\", \"0xd9\", \"0x2513\", \"0xa\", \"0x68ad1c688fccb1c7cc218b2a023c2571375ea10ff2508fa7098e42ca311656f4636e903c225e2d96eee92afdee4b310f8f5aa448bcf3bdef5a6dd09d63da348d00\"]" errorData=null
    protocol_test.go:230: goroutine 261 [running]:
        runtime/debug.Stack()
        	/opt/hostedtoolcache/go/1.25.7/x64/src/runtime/debug/stack.go:26 +0x5e
        github.com/offchainlabs/nitro/util/testhelpers.RequireImpl({0x1622190, 0xc000559dc0}, {0x16089e0, 0xc000541320}, {0x0, 0x0, 0x0})
        	/home/runner/work/nitro/nitro/util/testhelpers/testhelpers.go:29 +0x9f
        github.com/offchainlabs/nitro/daprovider/data_streaming.testBasic.func1()
        	/home/runner/work/nitro/nitro/daprovider/data_streaming/protocol_test.go:230 +0x19b
        created by github.com/offchainlabs/nitro/daprovider/data_streaming.testBasic in goroutine 109
        	/home/runner/work/nitro/nitro/daprovider/data_streaming/protocol_test.go:223 +0x85
        
    protocol_test.go:230: �[31;1m [] too much time has elapsed since request was signed �[0;0m
--- FAIL: TestDataStreaming_PositiveScenario/Many_senders,_long_messages (0.05s)
TestDataStreaming_PositiveScenario
Stack Traces | 0.080s run time
=== RUN   TestDataStreaming_PositiveScenario
--- FAIL: TestDataStreaming_PositiveScenario (0.08s)
TestRedisProduceComplex/two_producers,_some_consumers_killed,_others_should_take_over_their_work,_some_invalid_entries,_unequal_number_of_requests_from_producers
Stack Traces | 3.160s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
�[36mDEBUG�[0m[03-16|20:48:56.630] redis producer: check responses starting
�[36mDEBUG�[0m[03-16|20:48:56.630] checkResponses                           �[36mresponded�[0m=0   �[36merrored�[0m=0  �[36mchecked�[0m=0
�[36mDEBUG�[0m[03-16|20:48:56.632] checkResponses                           �[36mresponded�[0m=48  �[36merrored�[0m=6  �[36mchecked�[0m=80
�[36mDEBUG�[0m[03-16|20:48:56.635] redis producer: check responses starting
�[36mDEBUG�[0m[03-16|20:48:56.635] checkResponses                           �[36mresponded�[0m=0   �[36merrored�[0m=0  �[36mchecked�[0m=0
�[36mDEBUG�[0m[03-16|20:48:56.638] redis producer: check responses starting
�[36mDEBUG�[0m[03-16|20:48:56.641] redis producer: check responses starting
�[36mDEBUG�[0m[03-16|20:48:56.641] checkResponses                           �[36mresponded�[0m=0   �[36merrored�[0m=0  �[36mchecked�[0m=0
�[36mDEBUG�[0m[03-16|20:48:56.647] redis producer: check responses starting
�[36mDEBUG�[0m[03-16|20:48:56.647] checkResponses                           �[36mresponded�[0m=0   �[36merrored�[0m=0  �[36mchecked�[0m=0
�[36mDEBUG�[0m[03-16|20:48:56.654] redis producer: check responses starting
�[36mDEBUG�[0m[03-16|20:48:56.654] checkResponses                           �[36mresponded�[0m=0   �[36merrored�[0m=0  �[36mchecked�[0m=0
�[36mDEBUG�[0m[03-16|20:48:56.659] redis producer: check responses starting
�[36mDEBUG�[0m[03-16|20:48:56.660] checkResponses                           �[36mresponded�[0m=0   �[36merrored�[0m=0  �[36mchecked�[0m=0
�[36mDEBUG�[0m[03-16|20:48:56.663] consumer returned error                  �[36merror�[0m="invalid request: 1773694134540-0" �[36mmsgId�[0m=1773694134540-0
�[36mDEBUG�[0m[03-16|20:48:56.670] redis producer: check responses starting
�[36mDEBUG�[0m[03-16|20:48:56.670] checkResponses                           �[36mresponded�[0m=0   �[36merrored�[0m=0  �[36mchecked�[0m=0
�[36mDEBUG�[0m[03-16|20:48:56.673] checkResponses                           �[36mresponded�[0m=25  �[36merrored�[0m=1  �[36mchecked�[0m=26
�[31mERROR�[0m[03-16|20:48:56.676] Error from XpendingExt in getting PEL for auto claim �[31merr�[0m="NOGROUP No such key 'stream:08d87e96-a77e-446d-8870-3d5b75b8c2b6' or consumer group 'stream:08d87e96-a77e-446d-8870-3d5b75b8c2b6'" �[31mpendingLen�[0m=0
--- FAIL: TestRedisProduceComplex/two_producers,_some_consumers_killed,_others_should_take_over_their_work,_some_invalid_entries,_unequal_number_of_requests_from_producers (3.16s)

📣 Thoughts on this report? Let Codecov know! | Powered by Codecov

@joshuacolvin0 joshuacolvin0 marked this pull request as draft March 16, 2026 03:55
@joshuacolvin0 joshuacolvin0 force-pushed the freezer-safety-margin branch from 01645ff to 91a9166 Compare March 16, 2026 20:33
…lean shutdown

After an unclean shutdown, repair() may truncate the freezer head to
restore cross-table consistency. Previously, blocks were deleted from
the key-value store immediately after freezing, so truncated blocks
could end up missing from both stores — making the node unable to
start (especially for L2 nodes that cannot re-sync pruned blocks from
peers).

Introduce a safety margin (freezerCleanupMargin = freezerBatchLimit)
that retains the most recently frozen blocks in the key-value store.
Since freezeRange reads via nofreezedb (which bypasses the ancient
store), retained blocks can be re-frozen after repair() truncation.

Key changes:
- Add cleanupMargin field on chainFreezer with persisted cleanup tail
  (freezerCleanupTailKey) so progress resumes across restarts
- Replace immediate post-freeze deletion with incremental cleanup over
  [cleanupStart, cleanupLimit) using Has()+Get() to distinguish missing
  keys from I/O errors, with backoff on failure
- Add startup validation in Open(): detect unrecoverable data gaps
  where the freezer has been truncated below the cleanup tail
- Handle upgrade path (skip-ahead when no tail but frozen >
  FullImmutabilityThreshold) and fresh installs (clean from block 1)
- Cap per-cycle cleanup to freezerBatchLimit to prevent stalling
- Bound dangling side chain chase to freezerBatchLimit iterations
- Add ReadFreezerCleanupTail/WriteFreezerCleanupTail accessors and a
  strict variant for startup/runtime error propagation
- Surface cleanup tail in ReadChainMetadata diagnostics
- Add comprehensive test suite (21 tests) covering margin behavior,
  crash recovery, side chain cleanup, boundary conditions, corruption
  detection, upgrade path, and regression guard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants