Skip to content

DAOS-18976 rebuild: more delay when schedule FAIL_RECLAIM after net err#18317

Open
liuxuezhao wants to merge 2 commits into
liang/b2_6_rebuild_hangfrom
lxz/b2_6_rebuild_hang_append
Open

DAOS-18976 rebuild: more delay when schedule FAIL_RECLAIM after net err#18317
liuxuezhao wants to merge 2 commits into
liang/b2_6_rebuild_hangfrom
lxz/b2_6_rebuild_hang_append

Conversation

@liuxuezhao
Copy link
Copy Markdown
Contributor

When rebuild failed as network failure, delay 30 Seconds to schedule FAIL_RELAIM task.
For example when a rank dead, it firstly get -DER_HG or other network failure, but the pool map possibly has not changed to mark it as DOWN. If run FAIL_RECLAIM immediately the SCAN bcast possibly will fail and the retry will fail it again.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

- lets rebuild_scan_leader exit quickly so TLS cleanup unblocks
  before retry

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
@github-actions
Copy link
Copy Markdown

Ticket title is 'Aurora rebuild failing with DER_HG / DER_SHUTDOWN'
Status is 'In Progress'
Labels: 'test_2.6.5rc1'
https://daosio.atlassian.net/browse/DAOS-18976

@liuxuezhao liuxuezhao force-pushed the lxz/b2_6_rebuild_hang_append branch from 226c8a9 to 4764b16 Compare May 21, 2026 15:04
When rebuild failed as network failure, delay 30 Seconds to schedule
FAIL_RELAIM task.
For example when a rank dead, it firstly get -DER_HG or other network
failure, but the pool map possibly has not changed to mark it as DOWN.
If run FAIL_RECLAIM immediately the SCAN bcast possibly will fail and
the retry will fail it again.

And fixed a race between rebuild_tgt_fini() and
rebuild_pool_tls_lookup() in rebuild SCAN hander.

Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
@liuxuezhao liuxuezhao force-pushed the lxz/b2_6_rebuild_hang_append branch from 4764b16 to 485d91f Compare May 22, 2026 05:07
@liuxuezhao liuxuezhao force-pushed the liang/b2_6_rebuild_hang branch from 0e86c5d to eae35f3 Compare May 22, 2026 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants