DAOS-18633 rebuild: abort orphaned reclaim rpt after PS leader switch by wangshilong · Pull Request #17652 · daos-stack/daos

wangshilong · 2026-03-06T08:36:18Z

After PS leader switch, ds_rebuild_regenerate_task() only regenerates rebuild tasks for DOWN/DRAIN/UP targets. RECLAIM tasks are not regenerated because reintegrated targets are already UPIN. This leaves orphaned rpt on every target with a stale leader term, whose IV updates are silently dropped by the new leader (no matching rgt). The result is sp_rebuilding > 0 permanently, blocking EC aggregation and causing system-wide performance degradation.

Fix: detect stale leader term in rebuild_tgt_status_check_ult() and abort the orphaned rpt.

TODO: persist in-progress reclaim tasks in RDB so they can be properly re-triggered on PS leader step_up.

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

github-actions · 2026-03-06T08:36:35Z

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-18633

liuxuezhao · 2026-03-06T09:03:09Z

src/rebuild/srv.c

+				       DP_UUID(rpt->rt_pool_uuid), rpt->rt_rebuild_ver,
+				       rpt->rt_rebuild_gen, RB_OP_STR(rpt->rt_rebuild_op),
+				       rpt->rt_leader_term, ns->iv_master_term);
+				rpt->rt_abort = 1;


this looks not safe, currently it is supported that in PS leader change, if the rebuild is not failed before, then each tgt engine just continue its rebuild job.
see rebuild_task_ult(), "If the leader rebuild is aborted due to a leader change"

probably fine to only abort RECLAIM/FAIL_RECLAIM's RPT if PS leader switched, please consider to see if can combine the check/process with rpt_stale().
But even with it, I am not very sure if it is the reason of perf downgrade, because if the RECLAIM SCAN locally done and only RECLAIM not globally done, seems it should not affect IO perf.
Not sure if can find something if check the logs of did not report RECLAIM SCAN done's engines.

daosbuild3 · 2026-03-06T09:50:11Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17652/2/testReport/

After PS leader switch, ds_rebuild_regenerate_task() only regenerates rebuild tasks for DOWN/DRAIN/UP targets. RECLAIM tasks are not regenerated because reintegrated targets are already UPIN. This leaves orphaned rpt on every target with a stale leader term, whose IV updates are silently dropped by the new leader (no matching rgt). The result is sp_rebuilding > 0 permanently, blocking EC aggregation and causing system-wide performance degradation. Fix: detect stale leader term in rebuild_tgt_status_check_ult() and abort the orphaned rpt. Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

liuxuezhao reviewed Mar 6, 2026

View reviewed changes

wangshilong force-pushed the shilongw/DAOS-18633 branch from c76d9fd to 01a4148 Compare March 6, 2026 09:04

wangshilong force-pushed the shilongw/DAOS-18633 branch from 01a4148 to 0860470 Compare March 6, 2026 14:25

abort leader ult

0fd785d

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-18633 rebuild: abort orphaned reclaim rpt after PS leader switch#17652

DAOS-18633 rebuild: abort orphaned reclaim rpt after PS leader switch#17652
wangshilong wants to merge 2 commits intomasterfrom
shilongw/DAOS-18633

wangshilong commented Mar 6, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

liuxuezhao Mar 6, 2026

Uh oh!

daosbuild3 commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

wangshilong commented Mar 6, 2026

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

liuxuezhao Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

daosbuild3 commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants