Skip to content

Commit c76d9fd

Browse files
committed
DAOS-18633 rebuild: abort orphaned reclaim rpt after PS leader switch
After PS leader switch, ds_rebuild_regenerate_task() only regenerates rebuild tasks for DOWN/DRAIN/UP targets. RECLAIM tasks are not regenerated because reintegrated targets are already UPIN. This leaves orphaned rpt on every target with a stale leader term, whose IV updates are silently dropped by the new leader (no matching rgt). The result is sp_rebuilding > 0 permanently, blocking EC aggregation and causing system-wide performance degradation. Fix: detect stale leader term in rebuild_tgt_status_check_ult() and abort the orphaned rpt. TODO: persist in-progress reclaim tasks in RDB so they can be properly re-triggered on PS leader step_up. Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
1 parent b34e4e8 commit c76d9fd

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

src/rebuild/srv.c

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2972,6 +2972,23 @@ rebuild_tgt_status_check_ult(void *arg)
29722972
if (!rpt->rt_global_done) {
29732973
struct ds_iv_ns *ns = rpt->rt_pool->sp_iv_ns;
29742974

2975+
/* Abort orphaned rpt whose leader is gone.
2976+
* After PS leader switch, reclaim tasks are
2977+
* not regenerated (UPIN not in DOWN/UP/DRAIN),
2978+
* so this rpt has no matching rgt on the new
2979+
* leader and IV updates are silently dropped.
2980+
*/
2981+
if (rpt->rt_leader_term < ns->iv_master_term) {
2982+
D_WARN(DF_UUID " ver %d gen %u op %s: "
2983+
"stale term " DF_U64 " < " DF_U64
2984+
", abort orphaned rpt\n",
2985+
DP_UUID(rpt->rt_pool_uuid), rpt->rt_rebuild_ver,
2986+
rpt->rt_rebuild_gen, RB_OP_STR(rpt->rt_rebuild_op),
2987+
rpt->rt_leader_term, ns->iv_master_term);
2988+
rpt->rt_abort = 1;
2989+
break;
2990+
}
2991+
29752992
iv.riv_master_rank = ns->iv_master_rank;
29762993
iv.riv_rank = rpt->rt_rank;
29772994
iv.riv_ver = rpt->rt_rebuild_ver;

0 commit comments

Comments
 (0)