DAOS-18368 rebuild: fix bug of ec_agg_boundary and agg peer update#17324
DAOS-18368 rebuild: fix bug of ec_agg_boundary and agg peer update#17324
Conversation
|
Ticket title is 'Data corruption observed with master branch under MDonSSD environment.' |
122e9e8 to
10fa58e
Compare
|
Test stage NLT on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17324/2/display/redirect |
10fa58e to
66f44a9
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17324/4/testReport/ |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17324/4/execution/node/1313/log |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17324/4/execution/node/1323/log |
|
just refresh to change a few logs. |
32db84f to
e1c08ea
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17324/7/testReport/ |
e1c08ea to
c474fe4
Compare
56751dc to
f4bc272
Compare
kccain
left a comment
There was a problem hiding this comment.
rebuild/ source file changes LGTM.
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17324/12/execution/node/1277/log |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17324/12/execution/node/1318/log |
17ef124
f4bc272 to
17ef124
Compare
1. fix a bug of using ec_agg_boundary before checking its valid 2. add some more logs for rebuild fetch getting zero iod_size, to provide some hints for layout information. Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
Some failures need to be retried. Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
For reint ranks is excluded from rebuild/reclaim if the co_in_ver exceed rebuild ver. Should set its completion in rebuild leader to avoid possible stuck. Refine dtx_resync wait handling, need not wait anymore if previously already resynced. Add some log. Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
17ef124 to
90de18f
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17324/14/testReport/ |
|
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17324/14/execution/node/1075/log |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17324/15/testReport/ |
| int iod_cnt = 0; | ||
| int start; | ||
| char iov_buf[OBJ_ENUM_UNPACK_MAX_IODS][MAX_BUF_SIZE]; | ||
| char iov_buf[OBJ_ENUM_UNPACK_MAX_IODS][MAX_BUF_SIZE]; |
There was a problem hiding this comment.
Honestly, it is not good idea to put 32KB on stack that may potentially cause ULT overflow.
There was a problem hiding this comment.
right, it is original code this PR only change the indent.
| if (status->dtx_resync_version != resync_ver) | ||
| D_INFO(DF_RB " rank %d, update dtx_resync_version from %d to %d", DP_RB_RGT(rgt), | ||
| rank, status->dtx_resync_version, resync_ver); | ||
| status->dtx_resync_version = resync_ver; |
There was a problem hiding this comment.
if (status->dtx_resync_version != resync_ver) {
D_INFO(DF_RB " rank %d, update dtx_resync_version from %d to %d", DP_RB_RGT(rgt),
rank, status->dtx_resync_version, resync_ver);
status->dtx_resync_version = resync_ver;
}
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17324/15/execution/node/1350/log |
|
there is one known NLT err, another only test failure is interactive-rebuild https://daosio.atlassian.net/browse/DAOS-18466 |
1 similar comment
|
there is one known NLT err, another only test failure is interactive-rebuild https://daosio.atlassian.net/browse/DAOS-18466 |
| mrone->mo_epoch); | ||
| if (log_nr <= 128) { | ||
| mrone_dump_info(mrone, oh, &mrone->mo_iods[i]); | ||
| log_nr++; |
There was a problem hiding this comment.
we might want to reset this counter at somehow.
Steps for the author:
After all prior steps are complete: