Skip to content

Claude/finish query worker limits bkwxo3#110

Merged
maximusunc merged 3 commits into
mainfrom
claude/finish-query-worker-limits-bkwxo3
Jun 26, 2026
Merged

Claude/finish query worker limits bkwxo3#110
maximusunc merged 3 commits into
mainfrom
claude/finish-query-worker-limits-bkwxo3

Conversation

@maximusunc

Copy link
Copy Markdown
Collaborator

No description provided.

claude and others added 3 commits June 11, 2026 22:55
Workers are killed mid-task on SIGTERM today (heartbeat writes its marker
and exits immediately), so Kubernetes rollouts/scale-downs/node-drains drop
in-flight work that only Redis reclaim recovers. Make the shared task path a
good shutdown citizen and let ops tune concurrency per Deployment.

shepherd_utils/shared.py:
- install_shutdown_handlers(): asyncio-aware SIGTERM/SIGINT handlers that set
  a shutdown flag (loop.add_signal_handler, with a signal.signal fallback).
- get_tasks() stops pulling new work on shutdown and drains in-flight tasks by
  acquiring all concurrency-semaphore permits (every worker already releases
  its permit when a task finishes, so this needs no per-worker changes),
  bounded by worker_drain_timeout_sec, then exits 0. Stragglers fall to reclaim.
- _resolve_task_limit(): TASK_LIMIT env var overrides a worker's in-code
  default; each worker is its own container so one env per Deployment is
  unambiguous. No behavior change unless set.

shepherd_utils/heartbeat.py:
- manage_signals flag so get_tasks owns shutdown instead of the immediate-exit
  signal handler; mark_clean_shutdown() writes the marker from the loop so the
  monitor still classifies the exit as a clean scale-down, not a crash.

shepherd_utils/reclaim.py:
- finish_query idle floor 240s. Its async callback retries can run for minutes;
  at the 30s default a second consumer could XCLAIM mid-callback and deliver
  it twice.

config.py: worker_drain_timeout_sec (default 30s).
README + tests for the env override, drain/exit, clean marker, and shutdown.

https://claude.ai/code/session_019ZsKWm2SqKkGqvNqNBjfaU
… POST

The async callback built `payload` from the decompressed message but left
`message_bytes` referenced in scope, so both full copies stayed resident for
the entire (up to 120s x retries) POST -- doubling peak memory per in-flight
task under load. Rebind `message_bytes` to the spliced result instead so the
original buffer is freed as soon as the new one is built; only one copy is
held during the POST. Wire format is unchanged (still a single Content-Length
body).

https://claude.ai/code/session_019ZsKWm2SqKkGqvNqNBjfaU
…k holding on to too many full callback messages when trying to send them back
@maximusunc maximusunc merged commit 335c170 into main Jun 26, 2026
2 checks passed
@maximusunc maximusunc deleted the claude/finish-query-worker-limits-bkwxo3 branch June 26, 2026 16:38
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 62.22222% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.26%. Comparing base (f06c367) to head (9594469).
⚠️ Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
shepherd_utils/shared.py 62.66% 25 Missing and 3 partials ⚠️
shepherd_utils/heartbeat.py 60.00% 4 Missing ⚠️
workers/finish_query/worker.py 50.00% 2 Missing ⚠️
Files with missing lines Coverage Δ
shepherd_utils/config.py 100.00% <100.00%> (ø)
shepherd_utils/reclaim.py 16.98% <ø> (ø)
workers/finish_query/worker.py 78.82% <50.00%> (+0.34%) ⬆️
shepherd_utils/heartbeat.py 41.58% <60.00%> (+11.47%) ⬆️
shepherd_utils/shared.py 73.00% <62.66%> (+2.93%) ⬆️

... and 6 files with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e09b904...9594469. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants