Skip to content

[Fix][Connector-V2] Fix Doris sink retry backoff and scheduler leak#10772

Open
JAEKWANG97 wants to merge 3 commits intoapache:devfrom
JAEKWANG97:fix/doris-sink-retry-backoff-and-scheduler
Open

[Fix][Connector-V2] Fix Doris sink retry backoff and scheduler leak#10772
JAEKWANG97 wants to merge 3 commits intoapache:devfrom
JAEKWANG97:fix/doris-sink-retry-backoff-and-scheduler

Conversation

@JAEKWANG97
Copy link
Copy Markdown

@JAEKWANG97 JAEKWANG97 commented Apr 16, 2026

Purpose of this pull request

Fixes three resilience issues in the Doris sink connector that cause cascading OOM when Doris BE nodes are unavailable. Closes #10627.

  1. DorisSinkWriter.checkDone(): Stop the scheduled executor after a stream load failure is detected. Previously the scheduler ran indefinitely, flooding logs with repeated errors.

  2. DorisCommitter.commitTransaction(): Add exponential backoff between retries (1s, 2s, 4s… up to 16s) to reduce load on remaining infrastructure during BE unavailability.

  3. DorisCommitter.abortTransaction(): Fix maxRetry field never initialized from config, causing abort retry to run only once (defaulted to 0).

Does this PR introduce any user-facing change?

No. Internal retry and scheduling behavior is improved. No config options or public APIs changed.

How was this patch tested?

Manual code-path review only. No dedicated tests were added for this change.

Check list

@JAEKWANG97 JAEKWANG97 force-pushed the fix/doris-sink-retry-backoff-and-scheduler branch from b05a328 to fb3372a Compare April 16, 2026 02:50
Copy link
Copy Markdown
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanielLeens Can you review this PR?

@davidzollo davidzollo added the First-time contributor First-time contributor label Apr 16, 2026
@davidzollo davidzollo dismissed their stale review April 16, 2026 10:44

bot did, not me

Copy link
Copy Markdown
Collaborator

@yzeng1618 yzeng1618 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d suggest adding a small set of unit tests to lock in the new retry behavior:

​sleepBeforeRetry​: verify retry = 1 sleeps ~1s, retry = 5 caps at ~16s, and InterruptedException restores the interrupt flag and rethrows as IOException.
​commitTransaction​: mock the HttpClient to throw IOException N times before succeeding, and assert total elapsed time is at least the cumulative backoff.
​abortTransaction​: mock the FE to always fail and assert the number of requests equals maxRetries + 1.


private void sleepBeforeRetry(int retry) throws IOException {
try {
Thread.sleep(1000L * (1L << Math.min(retry - 1, 4)));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if retry <= 0 ever slips in, 1L << -1 becomes Long.MIN_VALUE and Thread.sleep throws.

Suggested change
Thread.sleep(1000L * (1L << Math.min(retry - 1, 4)));
int shift = Math.min(Math.max(retry - 1, 0), 4);
Thread.sleep(1000L * (1L << shift));

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I applied this in a new commit.

@JAEKWANG97
Copy link
Copy Markdown
Author

I’d suggest adding a small set of unit tests to lock in the new retry behavior:

I think checking the real backoff timing also makes sense, but I’m not sure how stable that would be in CI. It might be a bit flaky.

Would it be okay to use simpler tests for this, like verifying the retry path and retry count instead?

@JAEKWANG97
Copy link
Copy Markdown
Author

While working on this issue, I also found two more points in the same retry path:

  1. response cleanup is not handled in commitTransaction and abortTransaction
  2. sleepBeforeRetry is still called even after the last failed attempt

Since this is my first PR here, I tried to keep the scope small and focused on the retry/backoff behavior from the issue, so I did not include those changes in this PR.

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JAEKWANG97, thank you for the careful follow-up here, and welcome to the SeaTunnel community. I pulled the PR locally and reviewed the Doris sink retry path end to end.

Local branch: seatunnel-review-10772
Head reviewed: 9ed6a7a0f

Runtime path checked:

Doris sink writes records
  -> DorisSinkWriter.write()
      -> DorisStreamLoad.writeRecord()
  -> scheduled checker
      -> DorisSinkWriter.checkDone()
          -> getLoadFailedMsg() is not null
          -> set loadException
          -> shutdownNow() the scheduler
          -> the next write() fails through checkLoadException()

checkpoint / abort path
  -> DorisSinkWriter.prepareCommit()
      -> flush()
      -> returns DorisCommitInfo(txnId)
  -> DorisCommitter.commit()
      -> commitTransaction()
          -> httpClient.execute()
          -> retry with backoff on IOException or non-200 response
  -> DorisCommitter.abort()
      -> abortTransaction()
          -> retry abort with the same backoff path

The normal Doris 2PC commit/abort path and the streaming failure-check path both hit this PR.

Findings

Issue 1: failed HTTP responses are not closed before retrying

  • Location: seatunnel-connectors-v2/connector-doris/src/main/java/org/apache/seatunnel/connectors/doris/sink/committer/DorisCommitter.java:100
  • Why it matters: when Doris returns a non-200 response or an abort response without an entity, the code moves to the next retry without consuming/closing the CloseableHttpResponse. Since this PR makes abortTransaction() retry instead of failing immediately, that resource leak can now be amplified in the abort path.
  • Suggested fix: close/consume each failed response before retrying, and make sure the success path also closes the response after reading the entity.
  • Severity: medium

Issue 2: the code sleeps even after the last failed attempt

  • Location: DorisCommitter.java:97
  • Why it matters: while (retry++ <= maxRetries) increments retry before the sleep call. On the final failed attempt there is no next retry, but the code still waits. This slows checkpoint commit/abort failure convergence when Doris is unavailable.
  • Suggested fix: only sleep when another attempt remains, or rewrite the loop as an explicit for (int attempt = 0; attempt <= maxRetries; attempt++).
  • Severity: low

Issue 3: the new retry behavior has no small unit test yet

  • Location: DorisCommitter.java:136
  • Why it matters: this PR changes retry timing and interrupt handling. A small test can lock in retry count, interrupt flag restoration, and the "do not sleep after final failure" behavior.
  • Suggested fix: it is fine to avoid flaky wall-clock assertions. Mocking CloseableHttpClient and using an injectable/package-private sleeper would keep the test stable.
  • Severity: low

Local checks performed:

  • git fetch upstream pull/10772/head:seatunnel-review-10772 --force: passed.
  • git diff --stat upstream/dev...seatunnel-review-10772: 2 files, +41/-10.
  • Code inspection covered DorisCommitter and DorisSinkWriter.
  • GitHub checks observed: Build, labeler, and Notify test workflow are passing.
  • Local build/tests were not run in this review.

Merge conclusion

Conclusion: can merge after fixes

Blocking items:

  1. Issue 1: please close/consume failed HTTP responses before retrying. This is the one resource-safety point I think should be fixed in this PR.

Non-blocking suggestions:

  • Issue 2: skip the final unnecessary sleep.
  • Issue 3: add a small non-flaky retry behavior test.

Overall evaluation:

This is a good first contribution, and the direction is absolutely reasonable. The scheduler shutdown part is helpful, and the retry backoff is the right idea. Once the failed-response cleanup is handled, the patch will be much safer for real Doris failure scenarios. Thanks again for keeping the scope thoughtful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [Doris Connector] DorisSinkWriter lacks circuit breaker, causing cascading OOM when Doris BE is unavailable

4 participants