fix(transport,download): harden reconnect loop and download failure paths (#372, #373)#48
Merged
Conversation
…aths (#372, #373)
Transport (#372): fatal auth exit gated on 3 UNAUTHENTICATED strikes AND
>=60s elapsed; 30s deadline on the Hello/HelloAck handshake; redirect hop
budget resets on backoff fallback; schemeless not_leader targets inherit
the issuing connection's TLS mode; new grpc_ca_bundle setting; permanent
FAILED_PRECONDITION (worker_id_mismatch/release_id_mismatch/missing
identity) exits immediately.
Downloads (#373): typed UrlExpiredError on presigned-URL 4xx with zero
retries -> ModelEvent{FAILED, url_expired} in seconds; fix
_is_terminal_download_error to read exc.response.status_code; verify
(size/blake3) retries capped at 2; ENOSPC and pre-download headroom
shortfall -> InsufficientDiskError; CAS downloads report
progress(bytes_done, bytes_total); missing orchestrator snapshot maps to
RETRYABLE; blanket backoff decorator replaced, backoff dep dropped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes worker tracker issues #372 and #373 (agents/completed.md).
#372 transport hardening
redirect_hopsresets whenever the loop falls back with backoff, sonot_leaderrouting survives repeated leadership churn.not_leaderredirect targets inherit the TLS mode of the connection that issued the redirect (no plaintext downgrade); newgrpc_ca_bundlesetting (GRPC_CA_BUNDLE) for private CA roots.worker_id_mismatch/release_id_mismatch/missing identity) exits immediately instead of burning the disconnected timeout.#373 download failure path
UrlExpiredErroron presigned-URL 4xx (except 408/429), zero retries; ModelOp path emitsModelEvent{FAILED, url_expired}within seconds so the hub re-mints fresh URLs (CONTRACT §5/§9) instead of the old ~1h blind retry._is_terminal_download_errorreadingexc.status_codeinstead ofexc.response.status_code(requests.HTTPError) — also makes civitai 429/5xx correctly retryable.InsufficientDiskErrorimmediately;backoffdependency dropped.InsufficientDiskError.progress(bytes_done, bytes_total)end to end.Tests
uv run --extra dev pytest -q: 194 passed, 1 skipped (was 182 tests; +13 new: 6 transport, 7 download-failure-path; 1 pre-existing skip).