[Fix][Connector-V2] Fix Doris sink retry backoff and scheduler leak#10772
[Fix][Connector-V2] Fix Doris sink retry backoff and scheduler leak#10772JAEKWANG97 wants to merge 3 commits intoapache:devfrom
Conversation
b05a328 to
fb3372a
Compare
yzeng1618
left a comment
There was a problem hiding this comment.
I’d suggest adding a small set of unit tests to lock in the new retry behavior:
sleepBeforeRetry: verify retry = 1 sleeps ~1s, retry = 5 caps at ~16s, and InterruptedException restores the interrupt flag and rethrows as IOException.
commitTransaction: mock the HttpClient to throw IOException N times before succeeding, and assert total elapsed time is at least the cumulative backoff.
abortTransaction: mock the FE to always fail and assert the number of requests equals maxRetries + 1.
|
|
||
| private void sleepBeforeRetry(int retry) throws IOException { | ||
| try { | ||
| Thread.sleep(1000L * (1L << Math.min(retry - 1, 4))); |
There was a problem hiding this comment.
if retry <= 0 ever slips in, 1L << -1 becomes Long.MIN_VALUE and Thread.sleep throws.
| Thread.sleep(1000L * (1L << Math.min(retry - 1, 4))); | |
| int shift = Math.min(Math.max(retry - 1, 0), 4); | |
| Thread.sleep(1000L * (1L << shift)); |
There was a problem hiding this comment.
Thanks! I applied this in a new commit.
I think checking the real backoff timing also makes sense, but I’m not sure how stable that would be in CI. It might be a bit flaky. Would it be okay to use simpler tests for this, like verifying the retry path and retry count instead? |
|
While working on this issue, I also found two more points in the same retry path:
Since this is my first PR here, I tried to keep the scope small and focused on the retry/backoff behavior from the issue, so I did not include those changes in this PR. |
DanielLeens
left a comment
There was a problem hiding this comment.
Hi @JAEKWANG97, thank you for the careful follow-up here, and welcome to the SeaTunnel community. I pulled the PR locally and reviewed the Doris sink retry path end to end.
Local branch: seatunnel-review-10772
Head reviewed: 9ed6a7a0f
Runtime path checked:
Doris sink writes records
-> DorisSinkWriter.write()
-> DorisStreamLoad.writeRecord()
-> scheduled checker
-> DorisSinkWriter.checkDone()
-> getLoadFailedMsg() is not null
-> set loadException
-> shutdownNow() the scheduler
-> the next write() fails through checkLoadException()
checkpoint / abort path
-> DorisSinkWriter.prepareCommit()
-> flush()
-> returns DorisCommitInfo(txnId)
-> DorisCommitter.commit()
-> commitTransaction()
-> httpClient.execute()
-> retry with backoff on IOException or non-200 response
-> DorisCommitter.abort()
-> abortTransaction()
-> retry abort with the same backoff path
The normal Doris 2PC commit/abort path and the streaming failure-check path both hit this PR.
Findings
Issue 1: failed HTTP responses are not closed before retrying
- Location:
seatunnel-connectors-v2/connector-doris/src/main/java/org/apache/seatunnel/connectors/doris/sink/committer/DorisCommitter.java:100 - Why it matters: when Doris returns a non-200 response or an abort response without an entity, the code moves to the next retry without consuming/closing the
CloseableHttpResponse. Since this PR makesabortTransaction()retry instead of failing immediately, that resource leak can now be amplified in the abort path. - Suggested fix: close/consume each failed response before retrying, and make sure the success path also closes the response after reading the entity.
- Severity: medium
Issue 2: the code sleeps even after the last failed attempt
- Location:
DorisCommitter.java:97 - Why it matters:
while (retry++ <= maxRetries)incrementsretrybefore the sleep call. On the final failed attempt there is no next retry, but the code still waits. This slows checkpoint commit/abort failure convergence when Doris is unavailable. - Suggested fix: only sleep when another attempt remains, or rewrite the loop as an explicit
for (int attempt = 0; attempt <= maxRetries; attempt++). - Severity: low
Issue 3: the new retry behavior has no small unit test yet
- Location:
DorisCommitter.java:136 - Why it matters: this PR changes retry timing and interrupt handling. A small test can lock in retry count, interrupt flag restoration, and the "do not sleep after final failure" behavior.
- Suggested fix: it is fine to avoid flaky wall-clock assertions. Mocking
CloseableHttpClientand using an injectable/package-private sleeper would keep the test stable. - Severity: low
Local checks performed:
git fetch upstream pull/10772/head:seatunnel-review-10772 --force: passed.git diff --stat upstream/dev...seatunnel-review-10772: 2 files,+41/-10.- Code inspection covered
DorisCommitterandDorisSinkWriter. - GitHub checks observed:
Build,labeler, andNotify test workfloware passing. - Local build/tests were not run in this review.
Merge conclusion
Conclusion: can merge after fixes
Blocking items:
- Issue 1: please close/consume failed HTTP responses before retrying. This is the one resource-safety point I think should be fixed in this PR.
Non-blocking suggestions:
- Issue 2: skip the final unnecessary sleep.
- Issue 3: add a small non-flaky retry behavior test.
Overall evaluation:
This is a good first contribution, and the direction is absolutely reasonable. The scheduler shutdown part is helpful, and the retry backoff is the right idea. Once the failed-response cleanup is handled, the patch will be much safer for real Doris failure scenarios. Thanks again for keeping the scope thoughtful.
Purpose of this pull request
Fixes three resilience issues in the Doris sink connector that cause cascading OOM when Doris BE nodes are unavailable. Closes #10627.
DorisSinkWriter.checkDone(): Stop the scheduled executor after a stream load failure is detected. Previously the scheduler ran indefinitely, flooding logs with repeated errors.DorisCommitter.commitTransaction(): Add exponential backoff between retries (1s, 2s, 4s… up to 16s) to reduce load on remaining infrastructure during BE unavailability.DorisCommitter.abortTransaction(): FixmaxRetryfield never initialized from config, causing abort retry to run only once (defaulted to 0).Does this PR introduce any user-facing change?
No. Internal retry and scheduling behavior is improved. No config options or public APIs changed.
How was this patch tested?
Manual code-path review only. No dedicated tests were added for this change.
Check list
New License Guide
incompatible-changes.mdto describe the incompatibility caused by this PR.