Surface DB-write failures and add Postgres disk-capacity alerting by maximusunc · Pull Request #109 · BioPack-team/shepherd

maximusunc · 2026-06-25T23:10:42Z

Three related changes prompted by a Postgres outage where the data volume
filled up and queries silently stopped responding:

A. Server returns a real error at intake instead of stalling. run_query
swallowed failures from add_query/add_task and returned as if the query
was accepted, so a full/unavailable datastore left the sync path polling
a row that never existed until timeout (~6 min) and the async path
returning a fake 200 Accepted for a job that never ran. It now raises
QueryIntakeError, which run_sync_query/run_async_query translate into a
500 with a client-safe description.

B. Make the failure observable. shepherd_utils.db now classifies psycopg
errors by SQLSTATE and logs a stable, greppable PG_DISK_FULL marker for
disk-full (53100) across every worker, and the write helpers no longer
spin their full retry loop on non-transient errors.

C. Early-warning disk-capacity alert. New PG_VOLUME_CAPACITY setting (set
from the same Helm value that sizes the PVC) lets the monitor compute how
full the volume is from pg_database_size + WAL and fire warning/critical
alerts via a new db_capacity rule type. No-op until the env var is set.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy

Three related changes prompted by a Postgres outage where the data volume filled up and queries silently stopped responding: A. Server returns a real error at intake instead of stalling. run_query swallowed failures from add_query/add_task and returned as if the query was accepted, so a full/unavailable datastore left the sync path polling a row that never existed until timeout (~6 min) and the async path returning a fake 200 Accepted for a job that never ran. It now raises QueryIntakeError, which run_sync_query/run_async_query translate into a 500 with a client-safe description. B. Make the failure observable. shepherd_utils.db now classifies psycopg errors by SQLSTATE and logs a stable, greppable PG_DISK_FULL marker for disk-full (53100) across every worker, and the write helpers no longer spin their full retry loop on non-transient errors. C. Early-warning disk-capacity alert. New PG_VOLUME_CAPACITY setting (set from the same Helm value that sizes the PVC) lets the monitor compute how full the volume is from pg_database_size + WAL and fire warning/critical alerts via a new db_capacity rule type. No-op until the env var is set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy

codecov · 2026-06-25T23:21:12Z

Codecov Report

❌ Patch coverage is 19.23077% with 147 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.33%. Comparing base (f06c367) to head (fad56ce).
⚠️ Report is 35 commits behind head on main.

Files with missing lines	Patch %	Lines
shepherd_utils/db.py	19.48%	57 Missing and 5 partials ⚠️
workers/monitor/poller.py	12.19%	36 Missing ⚠️
workers/monitor/janitor.py	14.28%	18 Missing ⚠️
shepherd_server/base_routes.py	0.00%	12 Missing ⚠️
workers/monitor/alerts.py	7.69%	12 Missing ⚠️
shepherd_utils/config.py	61.11%	7 Missing ⚠️

Files with missing lines	Coverage Δ
shepherd_utils/logger.py	`100.00% <ø> (ø)`
shepherd_utils/config.py	`89.04% <61.11%> (-10.96%)`	⬇️
shepherd_server/base_routes.py	`0.00% <0.00%> (ø)`
workers/monitor/alerts.py	`46.33% <7.69%> (+46.33%)`	⬆️
workers/monitor/janitor.py	`17.92% <14.28%> (+17.92%)`	⬆️
workers/monitor/poller.py	`12.13% <12.19%> (+12.13%)`	⬆️
shepherd_utils/db.py	`56.83% <19.48%> (-18.17%)`	⬇️

... and 6 files with indirect coverage changes

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e09b904...fad56ce. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add a "DB disk" row to the Postgres/Redis infra panel. When PG_VOLUME_CAPACITY is configured the poller populates disk_used_pct, so the row shows percent full plus used/capacity (coloured warn >=80%, bad >=90%); otherwise it falls back to the raw database size. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy

New "DB disk used %" chart in the Infrastructure section, fed by the pg:disk_used_pct metric already written to the 30-day archive. Y-axis is fixed 0-100%. The series is empty on deployments that don't set PG_VOLUME_CAPACITY. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy

The shepherd_brain table grew unbounded: rows only ever flip to COMPLETED or ABANDONED, never get removed, so the durable query-state table kept growing even though the callbacks table is reaped on completion and the Redis query payloads expire via redis_ttl. Postgres has no native row TTL, so add the scheduled equivalent. A new daily janitor sweep (sweep_query_retention -> db.purge_old_queries) deletes terminal queries -- and any leftover callbacks -- older than query_retention_days (default 30, set 0 to disable). In-flight queries are never touched: only the abandoned-query reaper moves a stuck query to a terminal state, after which it becomes eligible. Runs in run_once, so /api/admin/cleanup triggers it too. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy

Stopping the DB exposed two problems: a query took ~60s to return an error, and the server/monitor flooded the logs polling the dead service. - Fast failure: every db.py call used pool.connection(60), so acquiring a connection blocked the full 60s when the DB was down. Add a configurable postgres_pool_timeout (default 5s, was an implicit 60) used as the pool default and at every call site, including the monitor poller. Tune up if pool timeouts show up under load. - Quiet the monitor poller: it logged a Postgres failure on every ~3s tick. Track reachability process-locally and log the outage once (healthy->down) and recovery once, debug in between. - Quiet the sync-query poll loop: the per-0.5s "failed to get query state" warning (which also fires normally while a query is in flight) is now debug. - Quiet psycopg's pool: it logs a WARNING on every failed reconnect attempt to keep min_size connections warm. Raise the psycopg.pool logger to ERROR so a DB outage doesn't flood every service's logs; our own code already logs the outage once. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy

claude added 2 commits June 25, 2026 14:34

Apply black formatting

a077c12

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy

claude added 4 commits June 25, 2026 23:33

maximusunc merged commit 95baee9 into main Jun 26, 2026
2 checks passed

maximusunc deleted the claude/affectionate-galileo-njdlbl branch June 26, 2026 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Surface DB-write failures and add Postgres disk-capacity alerting#109

Surface DB-write failures and add Postgres disk-capacity alerting#109
maximusunc merged 6 commits into
mainfrom
claude/affectionate-galileo-njdlbl

maximusunc commented Jun 25, 2026

Uh oh!

codecov Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

maximusunc commented Jun 25, 2026

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 25, 2026 •

edited

Loading