Skip to content

fix(transport,download): harden reconnect loop and download failure paths (#372, #373)#48

Merged
PaulFidika merged 1 commit into
masterfrom
issue/372-373-transport-download-hardening
Jul 4, 2026
Merged

fix(transport,download): harden reconnect loop and download failure paths (#372, #373)#48
PaulFidika merged 1 commit into
masterfrom
issue/372-373-transport-download-hardening

Conversation

@PaulFidika

Copy link
Copy Markdown
Contributor

Closes worker tracker issues #372 and #373 (agents/completed.md).

#372 transport hardening

  • Fatal auth exit gated on 3 UNAUTHENTICATED strikes AND >=60s elapsed — a hub pg blip / duplicate-stream teardown no longer kills the worker (counterpart tensorhub #539 fixes the status codes).
  • 30s deadline on the whole dial+Hello+HelloAck handshake: a hub that accepts the stream but never answers can no longer hang the worker forever; bounded attempts make the disconnected-timeout check effective.
  • redirect_hops resets whenever the loop falls back with backoff, so not_leader routing survives repeated leadership churn.
  • Schemeless not_leader redirect targets inherit the TLS mode of the connection that issued the redirect (no plaintext downgrade); new grpc_ca_bundle setting (GRPC_CA_BUNDLE) for private CA roots.
  • Permanent FAILED_PRECONDITION (worker_id_mismatch/release_id_mismatch/missing identity) exits immediately instead of burning the disconnected timeout.

#373 download failure path

  • Typed UrlExpiredError on presigned-URL 4xx (except 408/429), zero retries; ModelOp path emits ModelEvent{FAILED, url_expired} within seconds so the hub re-mints fresh URLs (CONTRACT §5/§9) instead of the old ~1h blind retry.
  • Fixed _is_terminal_download_error reading exc.status_code instead of exc.response.status_code (requests.HTTPError) — also makes civitai 429/5xx correctly retryable.
  • Blanket 30-try/1h backoff decorator replaced with an explicit policy loop: verify (size/blake3) failures capped at initial+2 retries; ENOSPC raises InsufficientDiskError immediately; backoff dependency dropped.
  • Pre-download disk-headroom check (missing blob bytes + 1GiB vs free) raises InsufficientDiskError.
  • CAS downloads now report progress(bytes_done, bytes_total) end to end.
  • Missing orchestrator-resolved snapshot for a tensorhub ref maps to RETRYABLE (hub residency bug), not client-visible 400.

Tests

uv run --extra dev pytest -q: 194 passed, 1 skipped (was 182 tests; +13 new: 6 transport, 7 download-failure-path; 1 pre-existing skip).

…aths (#372, #373)

Transport (#372): fatal auth exit gated on 3 UNAUTHENTICATED strikes AND
>=60s elapsed; 30s deadline on the Hello/HelloAck handshake; redirect hop
budget resets on backoff fallback; schemeless not_leader targets inherit
the issuing connection's TLS mode; new grpc_ca_bundle setting; permanent
FAILED_PRECONDITION (worker_id_mismatch/release_id_mismatch/missing
identity) exits immediately.

Downloads (#373): typed UrlExpiredError on presigned-URL 4xx with zero
retries -> ModelEvent{FAILED, url_expired} in seconds; fix
_is_terminal_download_error to read exc.response.status_code; verify
(size/blake3) retries capped at 2; ENOSPC and pre-download headroom
shortfall -> InsufficientDiskError; CAS downloads report
progress(bytes_done, bytes_total); missing orchestrator snapshot maps to
RETRYABLE; blanket backoff decorator replaced, backoff dep dropped.
@PaulFidika PaulFidika merged commit c2855b3 into master Jul 4, 2026
1 check passed
@PaulFidika PaulFidika deleted the issue/372-373-transport-download-hardening branch July 4, 2026 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant