[Bug](profile) fix wrong count of scan wait worker timer by BiteTheDDDDt · Pull Request #61064 · apache/doris

BiteTheDDDDt · 2026-03-05T07:08:59Z

What problem does this PR solve?

This pull request improves the accuracy and clarity of scanner timing and profiling in the ScannerScheduler implementation. The main changes clarify the order and logic for updating and recording timing counters, ensuring that CPU and wait times are measured more precisely during scanner execution and cleanup.

Enhancements to scanner timing and profiling:

Added detailed comments explaining the correct order for setting and getting timing counters (update_wait_worker_timer, start_scan_cpu_timer, update_scan_cpu_timer, start_wait_worker_timer) to ensure accurate measurement of scanner CPU and wait times.
Refactored the Defer cleanup logic to only start the wait worker timer if the scanner has not failed and has not reached the end-of-stream, preventing redundant or incorrect counter updates.

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

hello-stephen · 2026-03-05T07:09:04Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

BiteTheDDDDt · 2026-03-05T07:10:04Z

run buildall

Copilot

Pull request overview

This PR fixes a bug where start_wait_worker_timer() was being called unconditionally in the scanner's Defer cleanup block, even when the scanner had failed (!status.ok()) or had reached end-of-stream (eos == true). It also adds explanatory comments describing the correct ordering of timing counter operations. In the normal scan cycle, a scanner finishes a partial scan round and then gets re-queued; calling start_wait_worker_timer() at this point correctly begins measuring the next wait period. For failed or finished scanners, calling it was incorrect and redundant.

Changes:

Added a block comment (lines 156–160) explaining the required ordering of timing counter operations (update_wait_worker_timer, start_scan_cpu_timer, update_scan_cpu_timer, start_wait_worker_timer) for correctness.
Moved status and bool eos declarations before the Defer block to improve code clarity and make the conditional logic in the lambda readable.
Added a conditional guard in the Defer lambda so start_wait_worker_timer() is only called when status.ok() && !eos, preventing incorrect timer updates for failed or finished scanners.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-05T07:13:23Z

be/src/exec/scan/scanner_scheduler.cpp

    Defer defer_scanner([&] {
        // WorkloadGroup Policy will check cputime realtime, so that should update the counter
        // as soon as possible, could not update it on close.
        update_scanner_profile();
+        if (status.ok() && !eos) {
+            // if status is not ok, it means the scanner is failed, and the counter may be not updated correctly, so no need to update counter again. if eos is true, it means the scanner is finished successfully, and the counter is updated correctly, so no need to update counter again.


The inline comment on this line states "if eos is true, it means the scanner is finished successfully," but this is inaccurate. The eos flag is also set to true in error/cancellation paths (e.g., when scanner->prepare() or scanner->open() fails, or when ctx->done() is true). It would be more accurate to say "if eos is true, the scanner will not be re-queued (either because it completed scanning, encountered an error, or the context was cancelled), so no need to start the wait timer again."

Suggested change

// if status is not ok, it means the scanner is failed, and the counter may be not updated correctly, so no need to update counter again. if eos is true, it means the scanner is finished successfully, and the counter is updated correctly, so no need to update counter again.

// If status is not ok, the scanner has failed and its counters may not be updated

// correctly, so there is no need to restart the wait-worker timer. If eos is true,

// the scanner will not be re-queued (either because it completed scanning,

// encountered an error, or the context was cancelled), so there is no need to

// restart the wait-worker timer either.

github-actions · 2026-03-05T07:46:15Z

PR approved by at least one committer and no changes requested.

github-actions · 2026-03-05T07:46:18Z

PR approved by anyone and no changes requested.

hello-stephen · 2026-03-05T12:01:53Z

BE Regression && UT Coverage Report

Increment line coverage 100.00% (5/5) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	71.52% (26179/36603)
Line Coverage	54.31% (274527/505439)
Region Coverage	51.69% (228600/442241)
Branch Coverage	53.02% (98202/185226)

BiteTheDDDDt · 2026-03-09T07:15:08Z

run buildall

BiteTheDDDDt · 2026-03-09T07:50:22Z

run buildall

BiteTheDDDDt · 2026-03-11T02:54:59Z

run buildall

BiteTheDDDDt · 2026-03-11T03:10:52Z

run buildall

hello-stephen · 2026-03-11T05:50:38Z

BE Regression && UT Coverage Report

Increment line coverage 77.78% (7/9) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	71.58% (26203/36608)
Line Coverage	54.29% (274165/505006)
Region Coverage	51.47% (227611/442247)
Branch Coverage	52.94% (97975/185066)

zclllyybb · 2026-03-13T02:54:48Z

/review

github-actions · 2026-03-13T02:54:57Z

PR approved by at least one committer and no changes requested.

github-actions

Code Review Summary

This PR aims to fix incorrect counting of the scan wait worker timer by reordering timer start/stop calls and adding a _has_prepared guard to OlapScanner::update_realtime_counters(). While the intent is good, there is a correctness regression introduced by the refactoring.

Critical Checkpoints

Goal and correctness: The PR goal is to fix wrong scan wait worker timer counting. The _has_prepared guard move to OlapScanner::update_realtime_counters() is correct. However, the refactoring of the Defer block introduces a regression: update_scanner_profile() is no longer called on the normal yield path (scanner yields back to scheduler without eos or error), which breaks real-time CPU time and scan counter tracking for workload group policies.
Modification scope: The change is small and focused, but the Defer refactoring has unintended side effects.
Parallel code paths: FileScanner::update_realtime_counters() does not have the _has_prepared guard that was added to OlapScanner::update_realtime_counters(). While FileScanner initializes _file_reader_stats in init() (before prepare()), the inconsistency should be noted for maintainability.
Performance: The start_wait_worker_timer() call in the new Defer is redundant because ScannerScheduler::submit() (line 70 of this file) unconditionally calls start_wait_worker_timer() when the task is re-submitted, which resets and overwrites the timer started in the Defer.
Test coverage: No tests are included. The check list shows no test category selected. Given the behavioral change to profiling counters, at least manual verification of profile output should be documented.
Concurrency: No new concurrency issues introduced.
Memory safety: No new memory safety issues introduced.
Error handling: The _has_prepared guard correctly prevents null dereference on prepare failure.
Configuration: No new config items.
Incompatible changes: None.
Observability: The regression actually reduces observability by not updating real-time counters on yield paths.

github-actions · 2026-03-13T03:02:07Z

be/src/exec/scan/scanner_scheduler.cpp

    Status status = Status::OK();
    bool eos = false;
+    Defer defer_scanner([&] {
+        if (status.ok() && !eos) {


Bug: update_scanner_profile() is no longer called on normal yield path

The old code called update_scanner_profile() unconditionally in the Defer:

Defer defer_scanner([&] { // WorkloadGroup Policy will check cputime realtime, so that should update the counter // as soon as possible, could not update it on close. update_scanner_profile(); });

The new code only calls update_scanner_profile() when eos == true (at line 342), and the Defer only calls start_wait_worker_timer() when status.ok() && !eos. This means:

On the normal yield path (scanner yields back to scheduler because it hit the byte threshold, max run time, or other break conditions — where status.ok() && !eos), update_scan_cpu_timer() and update_realtime_counters() are never called.

The old comment explicitly stated: "WorkloadGroup Policy will check cputime realtime, so that should update the counter as soon as possible, could not update it on close." — This requirement is violated by the new code.

This will cause:

CPU time for each scanner slice to not be reported to the workload group until eos, making workload group CPU limits less responsive

Scan bytes/rows realtime counters to not be updated until eos, potentially allowing queries to exceed scan byte limits before being detected

Suggested fix: Keep calling update_scanner_profile() in the Defer (as before), and additionally call start_wait_worker_timer() only on the normal yield path:

Defer defer_scanner([&] { update_scanner_profile(); if (status.ok() && !eos) { scanner->start_wait_worker_timer(); } });

github-actions · 2026-03-13T03:02:07Z

be/src/exec/scan/scanner_scheduler.cpp

+    Defer defer_scanner([&] {
+        if (status.ok() && !eos) {
+            // if status is not ok, it means the scanner is failed, and the counter may be not updated correctly, so no need to update counter again. if eos is true, it means the scanner is finished successfully, and the counter is updated correctly, so no need to update counter again.
+            scanner->start_wait_worker_timer();


Minor: start_wait_worker_timer() here is redundant

start_wait_worker_timer() resets and restarts the internal _watch. However, ScannerScheduler::submit() (line 70 of this file) also calls start_wait_worker_timer() unconditionally when the scan task is re-submitted to the thread pool. Since submit() runs after the current _scanner_scan returns and push_back_scan_task is processed, the timer started here is immediately overwritten.

The actual wait time is measured from the submit() call (line 70) to the next update_wait_worker_timer() call (line 164) when a worker picks it up.

Copilot AI review requested due to automatic review settings March 5, 2026 07:09

Copilot started reviewing on behalf of BiteTheDDDDt March 5, 2026 07:09 View session

BiteTheDDDDt added the dev/4.0.x label Mar 5, 2026

Copilot AI reviewed Mar 5, 2026

View reviewed changes

Gabriel39 previously approved these changes Mar 5, 2026

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 5, 2026

github-actions bot added the reviewed label Mar 5, 2026

fix wrong count of scan wait worker timer

7268240

BiteTheDDDDt dismissed Gabriel39’s stale review via 7268240 March 9, 2026 07:14

BiteTheDDDDt force-pushed the fix_0305 branch from e7998f7 to 7268240 Compare March 9, 2026 07:14

github-actions bot removed the approved Indicates a PR has been approved by one committer. label Mar 9, 2026

update

f42d2d4

Merge branch 'master' into fix_0305

0f0bd1e

BiteTheDDDDt added the dev/4.1.x label Mar 13, 2026

Gabriel39 approved these changes Mar 13, 2026

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 13, 2026

zclllyybb approved these changes Mar 13, 2026

View reviewed changes

github-actions bot reviewed Mar 13, 2026

View reviewed changes

BiteTheDDDDt merged commit 986e410 into apache:master Mar 13, 2026
29 of 31 checks passed

github-actions bot added the dev/4.0.x-conflict label Mar 13, 2026

-            // if status is not ok, it means the scanner is failed, and the counter may be not updated correctly, so no need to update counter again. if eos is true, it means the scanner is finished successfully, and the counter is updated correctly, so no need to update counter again.
+            // If status is not ok, the scanner has failed and its counters may not be updated
+            // correctly, so there is no need to restart the wait-worker timer. If eos is true,
+            // the scanner will not be re-queued (either because it completed scanning,
+            // encountered an error, or the context was cancelled), so there is no need to
+            // restart the wait-worker timer either.

Conversation

BiteTheDDDDt commented Mar 5, 2026

What problem does this PR solve?

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Mar 5, 2026

Uh oh!

BiteTheDDDDt commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

hello-stephen commented Mar 5, 2026

BE Regression && UT Coverage Report

Uh oh!

BiteTheDDDDt commented Mar 9, 2026

Uh oh!

BiteTheDDDDt commented Mar 9, 2026

Uh oh!

BiteTheDDDDt commented Mar 11, 2026

Uh oh!

BiteTheDDDDt commented Mar 11, 2026

Uh oh!

hello-stephen commented Mar 11, 2026

BE Regression && UT Coverage Report

Uh oh!

zclllyybb commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review Summary

Critical Checkpoints

Uh oh!

github-actions bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants