Fix UDP retry condition: use $received_data flag instead of count($servers)#10
Fix UDP retry condition: use $received_data flag instead of count($servers)#10mcfnord wants to merge 2 commits intosoftins:masterfrom
Conversation
Replace the single 1.5s socket timeout with a 500ms timeout and up to 3 attempts. If no data is received within the window, the same request is re-sent. This recovers from dropped UDP packets without increasing the worst-case wait time beyond the original 1.5s.
…ers) The original count($servers) <= 1 heuristic fails for directory servers that legitimately have only one registered server: data is received successfully but the trailing timeout still triggers an unnecessary retry, adding ~500ms latency. Using an explicit $received_data flag correctly distinguishes "no response at all" (retry) from "stream ended normally" (stop). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
I bet if you run servers.php in the same datacenter as the directory servers themselves, the response loss rate would be nearly zero. From California, the loss rate is noticeable. In 14 days I'll have logs that reveal precise numbers about this PR's efficacy. |
|
Thanks, I'll read through and understand these changes before merging, as they are non-trivial. |
|
There's quite a bit in this AI-generated report, but I sense some fundamental errors (directory growth? And I thought this was about retrying for reliability, but gets only a tiny mention?) that may suggest this is total hogwash! Perhaps this is just an ill-fated experiment. PR #10 Data Quality Analysis — 2-Week Production Report Period: Apr 1–15, 2026 | Total requests: 754,276 directory queries What PR #10 Changed (vs. upstream)
Network Health Metrics (2 weeks) The week 1 failure rate is inflated by an Apr 6 outage: classical, jazz, rock, and choral directories all went dark simultaneously (12,732 failures in one day, zero-server rate spiked to 24.5%). Excluding that day, both weeks run ~5.1–5.7%. Directory expansion is notable: from 82 unique directories on Apr 2 to 127 by Apr 14, reflecting genuine network growth. Ongoing Issue: "Unexpected" Messages These occur when a server responds from a different IP:port than it registered under in the directory. The same IPs (e.g. 129.159.249.5:2215x) appear repeatedly — these are likely multi-homed or NAT-shifted servers. The count has grown from ~200/day (week 1) to ~300–400/day (week 2), tracking the directory expansion. PR #10 does not fix this. The responses are logged and skipped correctly. No data is lost — the CLM_EMPTY_MESSAGE handler already has partial logic to remap servers by host. But version/ping data for these servers is being silently dropped on every poll. Summary The primary measurable improvement from PR #10 is correct versionsort for servers with non-timestamped dev builds — confirmed clean across all 754K requests on PHP 8.1. The $done = true placement fix is protective but dormant (no individual server queries observed). The production-only timing instrumentation (also not in PR #10 proper) has been valuable for diagnosing the Apr 6 outage and confirming the retry mechanism works. The most significant remaining data quality gap is the ~11K "Unexpected" drops from port-mismatched servers — that's a separate bug worth its own fix. |
Summary
count($servers) <= 1retry guard with an explicit$received_databoolean flagProblem with the count($servers) approach
The
count($servers) <= 1heuristic conflates two different things:?queryand?serverrequests,$serversis pre-populated with 1 entry (the target server), so<= 1correctly means "no response data received yet."?directoryrequests,$serversstarts at 0 and grows asCLM_SERVER_LISTpackets arrive. A directory server that legitimately has only one registered server will receive its full response, hit the trailing timeout, and still satisfycount($servers) <= 1— triggering an unnecessary retry and adding ~500ms latency.Production logs confirm this is not hypothetical. Several real directory endpoints regularly return exactly one server:
Fix
An explicit
$received_dataflag is set totrueon the first successfully received packet, beforeprocess_received()is called. The retry and failure conditions check!$received_datainstead ofcount($servers) <= 1.This cleanly separates "did we receive anything?" from "how many servers were listed?"
Relationship to PR #7
This is a corrected version of the retry logic introduced in #7. The motivation and overall structure are the same.
🤖 Generated with Claude Code