[Fix][Zeta] Guard state cleanup races after node failure by zhangshenghang · Pull Request #10687 · apache/seatunnel

zhangshenghang · 2026-04-01T07:27:49Z

Purpose of this pull request

This PR fixes an engine-side terminal-state convergence bug after worker node failure.

When a worker goes offline, the engine can start cleaning distributed state from the running job state maps before all asynchronous task/pipeline/job callbacks have finished. In the current code path, PhysicalVertex, SubPlan, and PhysicalPlan can observe missing state entries and throw NullPointerException, which interrupts terminal-state convergence and may leave the job hanging in an intermediate state.

This PR changes the cleanup strategy instead of relying on local fallback state:

keep terminal job/pipeline/task state in distributed maps for a short cleanup delay window
remove runningJobInfoIMap immediately so terminal jobs are not restored on master switch
delay physical removal of distributed state maps until late callbacks have time to drain
treat already-cleaned state as a no-op defensive path instead of rebuilding distributed state

It also adds:

targeted regression tests for terminal tombstone behavior
a delay-based cleanup regression test
an engine E2E scenario for the BATCH + no checkpoint + job.retry.times=0 no-restore path

Does this PR introduce any user-facing change?

No user-facing API/config change in normal operation. This improves failure handling so jobs are less likely to hang in an intermediate state after node failure.

How was this patch tested?

Verified locally:

./mvnw -nsu -pl seatunnel-engine/seatunnel-engine-common spotless:check
./mvnw -nsu -pl seatunnel-engine/seatunnel-engine-server spotless:check
./mvnw -nsu -pl seatunnel-e2e/seatunnel-engine-e2e/connector-seatunnel-e2e-base spotless:check

Additional notes:

Added targeted regression tests: StateTransitionCleanupTest, JobStateCleanupDelayTest
Added engine E2E coverage: ClusterFailureNoRestoreIT
Full Maven test/compile validation in this checkout is currently blocked by unrelated upstream build issues in other modules (for example seatunnel-engine-server references missing types in the current checkout, and reactor builds are also blocked by seatunnel-config-shade compilation issues), so this PR remains draft.

…up-convergence

DanielLeens

I agree with the tombstone approach for late callbacks, but I think the delayed cleanup currently loses its cleanup owner across a second master failover.

scheduleRemoveJobStateMaps() removes runningJobInfoIMap immediately and then schedules removeJobStateMaps() only in the local monitorService. If that master dies during the delay window, the scheduled task disappears with it. When the next master restores, there is no runningJobInfoIMap entry left to rediscover this job, and restoreJobFromMasterActiveSwitch() just returns for terminal states.

That leaves the terminal entries in runningJobStateIMap / runningJobStateTimestampsIMap orphaned permanently. I think the delayed-cleanup intent needs to be persisted in distributed state (or another recoverable cleanup record), otherwise this closes the race only as long as the same master survives until the timer fires.

DanielLeens

Thanks for the update. I re-reviewed the latest HEAD locally, and I still see the same failover hole during the delayed-cleanup window.

cleanJob() still calls scheduleRemoveJobStateMaps() (JobMaster.java:778-782), and that method still removes runningJobInfoIMap immediately (JobMaster.java:644-647) before scheduling the delayed cleanup on the local monitorService (JobMaster.java:658-672). But master-switch restore only scans runningJobInfoIMap.entrySet() (CoordinatorService.java:636-642, 665-681).

So if the active master dies before the delayed task fires, the next master has no distributed record left to rediscover this terminal job, and the remaining state maps are still orphaned. The new end-state guard in restoreJobFromMasterActiveSwitch() does not close that gap, because it only runs for jobs that still have a runningJobInfoIMap entry.

The new JobStateCleanupDelayTest currently asserts that runningJobInfoIMap is already null immediately after terminal completion, which effectively codifies the same gap instead of covering the second-master-failover case.

I think the delayed-cleanup intent still needs to be persisted in recoverable distributed state, or runningJobInfoIMap needs to remain until delayed cleanup actually executes, before this can merge.

DanielLeens

Thanks for the update. I pulled the latest HEAD locally and re-reviewed the delayed-cleanup path.

I still see the same blocking failover hole during the cleanup-delay window. cleanJob() still calls scheduleRemoveJobStateMaps(), and that method still removes runningJobInfoIMap immediately before scheduling the delayed cleanup only on the local monitorService. But master-switch restore still discovers jobs only by scanning runningJobInfoIMap. So if the active master dies before the delayed task fires, the next master has no distributed record left to rediscover this terminal job, and the remaining state maps are still orphaned.

The new end-state guard in restoreJobFromMasterActiveSwitch() does not close that gap because it only runs for jobs that still have a runningJobInfoIMap entry. The new JobStateCleanupDelayTest currently asserts that runningJobInfoIMap is already null during the delay window, which codifies the same hole instead of covering the second-master-failover case.

I think the delayed cleanup intent still needs to be persisted in recoverable distributed state, or runningJobInfoIMap needs to stay until the delayed cleanup actually executes. After that, this will be much closer.

DanielLeens

I pulled the latest HEAD locally again and I still see the same blocking failover hole during the cleanup-delay window.

JobMaster.scheduleRemoveJobStateMaps() still removes runningJobInfoIMap immediately before scheduling the delayed cleanup only on the local monitorService. But master-switch restore still discovers jobs by scanning runningJobInfoIMap in CoordinatorService.restoreAllRunningJobFromMasterNodeSwitch(). So if the active master dies before the delayed task fires, the next master still has no distributed record left to rediscover this terminal job, and the remaining state maps can still be orphaned.

The new stateCleanupDelayMillis=0 test config and the late-checkpoint guard do not close that gap, because they do not persist the delayed-cleanup intent across a second master failover.

I still think this needs one of these two directions before merge:

keep runningJobInfoIMap until the delayed cleanup actually executes, or
persist the delayed-cleanup intent in recoverable distributed state.

After that, I am happy to re-review.

DanielLeens

I pulled the latest HEAD locally again and re-checked the terminal-cleanup / master-switch path.

The previous failover hole looks closed now: JobMaster.scheduleRemoveJobStateMaps() persists a JobCleanupRecord in IMAP_PENDING_JOB_CLEANUP, CoordinatorService.restoreJobFromMasterActiveSwitch() reschedules terminal cleanup instead of dropping the job blindly, and the REST / overview paths filter delayed-cleanup tombstones so finished jobs are not shown as running. With the new unit / E2E coverage around delayed cleanup and no-restore cluster failure, I do not see the previous blocker in the current revision.

…up-convergence

DanielLeens

Thanks for the latest update. I re-reviewed the current head locally as seatunnel-review-10687 at commit 17750abb9, comparing it with upstream/dev.

What This PR Fixes

User pain: after terminal job/pipeline state cleanup, late asynchronous callbacks or active-master switch recovery can observe missing runtime state and either report misleading errors or leave stale job metadata behind.
Fix approach: the PR delays terminal job-state cleanup, records pending cleanup metadata in Hazelcast, reschedules cleanup after master failover, and prevents a new non-savepoint submission from reusing a job id while the old terminal state is still waiting for cleanup.
One-line value: terminal state becomes a short-lived tombstone instead of disappearing immediately, which makes late callbacks and master failover safer.

Core Logic Review

Key changed files and methods:

CoordinatorService.java: schedulePendingJobCleanup(...), processPendingJobCleanup(...), restoreAllRunningJobFromMasterNodeSwitch(...), and the submitJob(...) pending-cleanup guard.
JobMaster.java: createJobCleanupRecord() and scheduleRemoveJobStateMaps().
JobCleanupRecord.java: distributed cleanup metadata for job-level state keys.
JobInfoService.java: hides terminal jobs that are retained only as cleanup tombstones.

Important before/after point:

// Before: terminal state could be removed immediately, so late callbacks saw null state.
removeJobStateMaps();

// After: terminal state is retained and removed later by a cleanup record.
pendingJobCleanupIMap.put(jobId, cleanupRecord);
coordinatorService.schedulePendingJobCleanup(jobId, cleanupRecord);

The normal Zeta lifecycle does hit this change:

Job reaches terminal state
  -> PhysicalPlan.addPipelineEndCallback()
      -> JobMaster.initStateFuture() completion handler
          -> JobMaster.cleanJob()
              -> createJobCleanupRecord()
              -> pendingJobCleanupIMap.put(jobId, record)
              -> CoordinatorService.schedulePendingJobCleanup(jobId, record)

Delayed cleanup
  -> CoordinatorService.processPendingJobCleanup(jobId, record)
      -> verifies initializationTimestamp still matches current JobInfo
      -> verifies current job state is terminal
      -> removes runningJobInfoIMap
      -> removes recorded state/timestamp keys

Active-master switch before cleanup fires
  -> CoordinatorService.restoreAllRunningJobFromMasterNodeSwitch()
      -> terminal-state zombie jobs are pre-filtered
      -> restoreJobFromMasterActiveSwitch()
          -> reschedules pending cleanup if a cleanup record exists
          -> otherwise removes stale runningJobInfoIMap

Local static verification:

git diff --stat upstream/dev...seatunnel-review-10687: 25 files changed, +1272/-45.
git diff --name-status upstream/dev...seatunnel-review-10687: Zeta coordinator/job-master cleanup code, serialization hook, REST visibility, tests, and one E2E are touched.
gh pr checks 10687: Build is CANCELLED; label and notification checks are successful.
Local build/tests: not run. This review is based on the local branch, full diff, and the job lifecycle / failover call-chain inspection.

Compatibility, Side Effects, Errors, and Logs

Compatibility impact: mostly compatible, with one intentional operational behavior change. Finished jobs are retained in runtime maps for state-cleanup-delay-ms before cleanup, while JobInfoService.shouldShowAsRunningJob() prevents these tombstones from showing as running jobs. No public API, protocol, or serialization format used by clients is removed, but a new internal Hazelcast data type is added via ResourceDataSerializerHook.

Performance and side effects: the default 60s retention adds bounded temporary IMap entries for terminal jobs/pipelines/tasks and checkpoint state keys. Cleanup scheduling uses the existing monitor service and should not add hot-path CPU/network cost. The owner timestamp guard is important and is present, so delayed cleanup should not remove a newly submitted job with the same id.

Error handling and logs: cleanup failures are logged and retried through retained cleanup records. Late state transitions now log and skip when the state entry is already missing or terminal, instead of trying to force an invalid transition.

Findings

I did not find a new source-level blocker in the latest code. The remaining blocker is CI status.

Merge Conclusion

Conclusion: can merge after CI is rerun successfully

Blocking items:
- CI: Build is currently CANCELLED on the latest head (17750abb9). This must be rerun and pass before merge.
Suggested non-blocking items:
- None from this latest source review.

Overall assessment: the design is a reasonable long-term fix for the terminal-state cleanup race. It keeps state retention bounded, records enough ownership data to avoid deleting a newer job, and includes targeted unit coverage for cleanup ownership, submit blocking, delayed cleanup, and late state transitions. Once Build is green, I think this PR can move forward.

[Fix][Zeta] Guard state cleanup races after node failure

6960ac4

github-actions bot added Zeta e2e labels Apr 1, 2026

zhangshenghang marked this pull request as ready for review April 1, 2026 09:52

zhangshenghang added 6 commits April 1, 2026 19:00

[Chore][Zeta] Drop unrelated AGENTS guide change

55c46e1

[Fix][Zeta] Delay terminal state cleanup after job finish

8fde150

Chore: Update AGENTS documentation formatting

a14883d

[Test][Zeta] Fix cleanup regression test import

e5c3ee0

Merge remote-tracking branch 'upstream/dev' into fix/zeta-state-clean…

7ce89a7

…up-convergence

[Test][Zeta] Fix cleanup regression test pipeline import

2b996b5

DanielLeens suggested changes Apr 6, 2026

View reviewed changes

[Fix][Zeta] Set state cleanup delay to zero in SeaTunnelConfig

7571a7b

DanielLeens suggested changes Apr 7, 2026

View reviewed changes

zhangshenghang added 2 commits April 8, 2026 09:41

[Fix][Zeta] Ignore late checkpoint state updates after cleanup

72da2b9

[Fix][Zeta] Use slf4j warn for cleaned checkpoint state

7c369ae

DanielLeens suggested changes Apr 8, 2026

View reviewed changes

[Fix][Zeta] Disable delayed cleanup in client test config

79397b8

DanielLeens suggested changes Apr 9, 2026

View reviewed changes

[Fix][Zeta] Recover delayed job cleanup after master failover

30986b1

github-actions bot added the Zeta Rest API label Apr 9, 2026

DanielLeens approved these changes Apr 9, 2026

View reviewed changes

github-actions bot added the reviewed label Apr 9, 2026

Merge remote-tracking branch 'upstream/dev' into HEAD

26ac8b2

davidzollo removed the reviewed label Apr 15, 2026

zhangshenghang added 3 commits April 17, 2026 14:09

[Fix][Zeta] Avoid imap read in submit job

3f3e036

Merge remote-tracking branch 'upstream/dev' into fix/zeta-state-clean…

af3ca7b

…up-convergence

[Fix][Zeta] Avoid submit job cleanup race on operation thread

17750ab

DanielLeens reviewed Apr 19, 2026

View reviewed changes

github-actions bot added the reviewed label Apr 19, 2026

[Fix][Zeta] Clean stale zombie job state on master switch

2cb54ca

[Test][E2E] Relax jdbc schema change assertion timeout

76ab078

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix][Zeta] Guard state cleanup races after node failure#10687

[Fix][Zeta] Guard state cleanup races after node failure#10687
zhangshenghang wants to merge 18 commits intoapache:devfrom
zhangshenghang:fix/zeta-state-cleanup-convergence

zhangshenghang commented Apr 1, 2026 •

edited

Loading

Uh oh!

DanielLeens left a comment

Uh oh!

DanielLeens left a comment

Uh oh!

DanielLeens left a comment

Uh oh!

DanielLeens left a comment

Uh oh!

DanielLeens left a comment

Uh oh!

DanielLeens left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhangshenghang commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

What This PR Fixes

Core Logic Review

Compatibility, Side Effects, Errors, and Logs

Findings

Merge Conclusion

Conclusion: can merge after CI is rerun successfully

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhangshenghang commented Apr 1, 2026 •

edited

Loading