[Fix][Connector-V2] Fix Doris sink retry backoff and scheduler leak by JAEKWANG97 · Pull Request #10772 · apache/seatunnel

JAEKWANG97 · 2026-04-16T02:47:01Z

Purpose of this pull request

Fixes three resilience issues in the Doris sink connector that cause cascading OOM when Doris BE nodes are unavailable. Closes #10627.

DorisSinkWriter.checkDone(): Stop the scheduled executor after a stream load failure is detected. Previously the scheduler ran indefinitely, flooding logs with repeated errors.
DorisCommitter.commitTransaction(): Add exponential backoff between retries (1s, 2s, 4s… up to 16s) to reduce load on remaining infrastructure during BE unavailability.
DorisCommitter.abortTransaction(): Fix maxRetry field never initialized from config, causing abort retry to run only once (defaulted to 0).

Does this PR introduce any user-facing change?

No. Internal retry and scheduling behavior is improved. No config options or public APIs changed.

How was this patch tested?

Manual code-path review only. No dedicated tests were added for this change.

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

… duplication

davidzollo

@DanielLeens Can you review this PR?

bot did, not me

yzeng1618

I’d suggest adding a small set of unit tests to lock in the new retry behavior:

sleepBeforeRetry: verify retry = 1 sleeps ~1s, retry = 5 caps at ~16s, and InterruptedException restores the interrupt flag and rethrows as IOException.
commitTransaction: mock the HttpClient to throw IOException N times before succeeding, and assert total elapsed time is at least the cumulative backoff.
abortTransaction: mock the FE to always fail and assert the number of requests equals maxRetries + 1.

yzeng1618 · 2026-04-18T13:38:25Z


+    private void sleepBeforeRetry(int retry) throws IOException {
+        try {
+            Thread.sleep(1000L * (1L << Math.min(retry - 1, 4)));


if retry <= 0 ever slips in, 1L << -1 becomes Long.MIN_VALUE and Thread.sleep throws.

Suggested change

Thread.sleep(1000L * (1L << Math.min(retry - 1, 4)));

int shift = Math.min(Math.max(retry - 1, 0), 4);

Thread.sleep(1000L * (1L << shift));

Thanks! I applied this in a new commit.

JAEKWANG97 · 2026-04-18T20:41:18Z

I’d suggest adding a small set of unit tests to lock in the new retry behavior:

I think checking the real backoff timing also makes sense, but I’m not sure how stable that would be in CI. It might be a bit flaky.

Would it be okay to use simpler tests for this, like verifying the retry path and retry count instead?

JAEKWANG97 · 2026-04-18T20:55:53Z

While working on this issue, I also found two more points in the same retry path:

response cleanup is not handled in commitTransaction and abortTransaction
sleepBeforeRetry is still called even after the last failed attempt

Since this is my first PR here, I tried to keep the scope small and focused on the retry/backoff behavior from the issue, so I did not include those changes in this PR.

DanielLeens

Hi @JAEKWANG97, thank you for the careful follow-up here, and welcome to the SeaTunnel community. I pulled the PR locally and reviewed the Doris sink retry path end to end.

Local branch: seatunnel-review-10772
Head reviewed: 9ed6a7a0f

Runtime path checked:

Doris sink writes records
  -> DorisSinkWriter.write()
      -> DorisStreamLoad.writeRecord()
  -> scheduled checker
      -> DorisSinkWriter.checkDone()
          -> getLoadFailedMsg() is not null
          -> set loadException
          -> shutdownNow() the scheduler
          -> the next write() fails through checkLoadException()

checkpoint / abort path
  -> DorisSinkWriter.prepareCommit()
      -> flush()
      -> returns DorisCommitInfo(txnId)
  -> DorisCommitter.commit()
      -> commitTransaction()
          -> httpClient.execute()
          -> retry with backoff on IOException or non-200 response
  -> DorisCommitter.abort()
      -> abortTransaction()
          -> retry abort with the same backoff path

The normal Doris 2PC commit/abort path and the streaming failure-check path both hit this PR.

Findings

Issue 1: failed HTTP responses are not closed before retrying

Location: seatunnel-connectors-v2/connector-doris/src/main/java/org/apache/seatunnel/connectors/doris/sink/committer/DorisCommitter.java:100
Why it matters: when Doris returns a non-200 response or an abort response without an entity, the code moves to the next retry without consuming/closing the CloseableHttpResponse. Since this PR makes abortTransaction() retry instead of failing immediately, that resource leak can now be amplified in the abort path.
Suggested fix: close/consume each failed response before retrying, and make sure the success path also closes the response after reading the entity.
Severity: medium

Issue 2: the code sleeps even after the last failed attempt

Location: DorisCommitter.java:97
Why it matters: while (retry++ <= maxRetries) increments retry before the sleep call. On the final failed attempt there is no next retry, but the code still waits. This slows checkpoint commit/abort failure convergence when Doris is unavailable.
Suggested fix: only sleep when another attempt remains, or rewrite the loop as an explicit for (int attempt = 0; attempt <= maxRetries; attempt++).
Severity: low

Issue 3: the new retry behavior has no small unit test yet

Location: DorisCommitter.java:136
Why it matters: this PR changes retry timing and interrupt handling. A small test can lock in retry count, interrupt flag restoration, and the "do not sleep after final failure" behavior.
Suggested fix: it is fine to avoid flaky wall-clock assertions. Mocking CloseableHttpClient and using an injectable/package-private sleeper would keep the test stable.
Severity: low

Local checks performed:

git fetch upstream pull/10772/head:seatunnel-review-10772 --force: passed.
git diff --stat upstream/dev...seatunnel-review-10772: 2 files, +41/-10.
Code inspection covered DorisCommitter and DorisSinkWriter.
GitHub checks observed: Build, labeler, and Notify test workflow are passing.
Local build/tests were not run in this review.

Merge conclusion

Conclusion: can merge after fixes

Blocking items:

Issue 1: please close/consume failed HTTP responses before retrying. This is the one resource-safety point I think should be fixed in this PR.

Non-blocking suggestions:

Issue 2: skip the final unnecessary sleep.
Issue 3: add a small non-flaky retry behavior test.

Overall evaluation:

This is a good first contribution, and the direction is absolutely reasonable. The scheduler shutdown part is helpful, and the retry backoff is the right idea. Once the failed-response cleanup is handled, the patch will be much safer for real Doris failure scenarios. Thanks again for keeping the scope thoughtful.

github-actions bot added connectors-v2 doris labels Apr 16, 2026

[Fix][Connector-V2] Fix Doris sink retry backoff and scheduler leak

fb3372a

JAEKWANG97 force-pushed the fix/doris-sink-retry-backoff-and-scheduler branch from b05a328 to fb3372a Compare April 16, 2026 02:50

[Fix][Connector-V2] Extract sleepBeforeRetry helper to remove backoff…

1a05ae0

… duplication

davidzollo previously requested changes Apr 16, 2026

View reviewed changes

davidzollo added the First-time contributor First-time contributor label Apr 16, 2026

yzeng1618 reviewed Apr 18, 2026

View reviewed changes

[Fix][Connector-V2] Improve Doris committer retry handling

9ed6a7a

DanielLeens suggested changes Apr 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix][Connector-V2] Fix Doris sink retry backoff and scheduler leak#10772

[Fix][Connector-V2] Fix Doris sink retry backoff and scheduler leak#10772
JAEKWANG97 wants to merge 3 commits intoapache:devfrom
JAEKWANG97:fix/doris-sink-retry-backoff-and-scheduler

JAEKWANG97 commented Apr 16, 2026 •

edited

Loading

Uh oh!

davidzollo left a comment •

edited

Loading

Uh oh!

yzeng1618 left a comment

Uh oh!

yzeng1618 Apr 18, 2026

Uh oh!

JAEKWANG97 Apr 18, 2026

Uh oh!

JAEKWANG97 commented Apr 18, 2026

Uh oh!

JAEKWANG97 commented Apr 18, 2026

Uh oh!

DanielLeens left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	Thread.sleep(1000L * (1L << Math.min(retry - 1, 4)));
	int shift = Math.min(Math.max(retry - 1, 0), 4);
	Thread.sleep(1000L * (1L << shift));

Conversation

JAEKWANG97 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

davidzollo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yzeng1618 left a comment

Choose a reason for hiding this comment

Uh oh!

yzeng1618 Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

JAEKWANG97 Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

JAEKWANG97 commented Apr 18, 2026

Uh oh!

JAEKWANG97 commented Apr 18, 2026

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Findings

Merge conclusion

Conclusion: can merge after fixes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JAEKWANG97 commented Apr 16, 2026 •

edited

Loading

davidzollo left a comment •

edited

Loading