DAOS-18976 rebuild: more delay when schedule FAIL_RECLAIM after net err#18317
Open
liuxuezhao wants to merge 2 commits into
Open
DAOS-18976 rebuild: more delay when schedule FAIL_RECLAIM after net err#18317liuxuezhao wants to merge 2 commits into
liuxuezhao wants to merge 2 commits into
Conversation
- lets rebuild_scan_leader exit quickly so TLS cleanup unblocks before retry Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
6 tasks
|
Ticket title is 'Aurora rebuild failing with DER_HG / DER_SHUTDOWN' |
226c8a9 to
4764b16
Compare
When rebuild failed as network failure, delay 30 Seconds to schedule FAIL_RELAIM task. For example when a rank dead, it firstly get -DER_HG or other network failure, but the pool map possibly has not changed to mark it as DOWN. If run FAIL_RECLAIM immediately the SCAN bcast possibly will fail and the retry will fail it again. And fixed a race between rebuild_tgt_fini() and rebuild_pool_tls_lookup() in rebuild SCAN hander. Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
4764b16 to
485d91f
Compare
0e86c5d to
eae35f3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When rebuild failed as network failure, delay 30 Seconds to schedule FAIL_RELAIM task.
For example when a rank dead, it firstly get -DER_HG or other network failure, but the pool map possibly has not changed to mark it as DOWN. If run FAIL_RECLAIM immediately the SCAN bcast possibly will fail and the retry will fail it again.
Steps for the author:
After all prior steps are complete: