Skip to content

Surface DB-write failures and add Postgres disk-capacity alerting#109

Merged
maximusunc merged 6 commits into
mainfrom
claude/affectionate-galileo-njdlbl
Jun 26, 2026
Merged

Surface DB-write failures and add Postgres disk-capacity alerting#109
maximusunc merged 6 commits into
mainfrom
claude/affectionate-galileo-njdlbl

Conversation

@maximusunc

Copy link
Copy Markdown
Collaborator

Three related changes prompted by a Postgres outage where the data volume
filled up and queries silently stopped responding:

A. Server returns a real error at intake instead of stalling. run_query
swallowed failures from add_query/add_task and returned as if the query
was accepted, so a full/unavailable datastore left the sync path polling
a row that never existed until timeout (~6 min) and the async path
returning a fake 200 Accepted for a job that never ran. It now raises
QueryIntakeError, which run_sync_query/run_async_query translate into a
500 with a client-safe description.

B. Make the failure observable. shepherd_utils.db now classifies psycopg
errors by SQLSTATE and logs a stable, greppable PG_DISK_FULL marker for
disk-full (53100) across every worker, and the write helpers no longer
spin their full retry loop on non-transient errors.

C. Early-warning disk-capacity alert. New PG_VOLUME_CAPACITY setting (set
from the same Helm value that sizes the PVC) lets the monitor compute how
full the volume is from pg_database_size + WAL and fire warning/critical
alerts via a new db_capacity rule type. No-op until the env var is set.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy

claude added 2 commits June 25, 2026 14:34
Three related changes prompted by a Postgres outage where the data volume
filled up and queries silently stopped responding:

A. Server returns a real error at intake instead of stalling. run_query
   swallowed failures from add_query/add_task and returned as if the query
   was accepted, so a full/unavailable datastore left the sync path polling
   a row that never existed until timeout (~6 min) and the async path
   returning a fake 200 Accepted for a job that never ran. It now raises
   QueryIntakeError, which run_sync_query/run_async_query translate into a
   500 with a client-safe description.

B. Make the failure observable. shepherd_utils.db now classifies psycopg
   errors by SQLSTATE and logs a stable, greppable PG_DISK_FULL marker for
   disk-full (53100) across every worker, and the write helpers no longer
   spin their full retry loop on non-transient errors.

C. Early-warning disk-capacity alert. New PG_VOLUME_CAPACITY setting (set
   from the same Helm value that sizes the PVC) lets the monitor compute how
   full the volume is from pg_database_size + WAL and fire warning/critical
   alerts via a new db_capacity rule type. No-op until the env var is set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 19.23077% with 147 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.33%. Comparing base (f06c367) to head (fad56ce).
⚠️ Report is 35 commits behind head on main.

Files with missing lines Patch % Lines
shepherd_utils/db.py 19.48% 57 Missing and 5 partials ⚠️
workers/monitor/poller.py 12.19% 36 Missing ⚠️
workers/monitor/janitor.py 14.28% 18 Missing ⚠️
shepherd_server/base_routes.py 0.00% 12 Missing ⚠️
workers/monitor/alerts.py 7.69% 12 Missing ⚠️
shepherd_utils/config.py 61.11% 7 Missing ⚠️
Files with missing lines Coverage Δ
shepherd_utils/logger.py 100.00% <ø> (ø)
shepherd_utils/config.py 89.04% <61.11%> (-10.96%) ⬇️
shepherd_server/base_routes.py 0.00% <0.00%> (ø)
workers/monitor/alerts.py 46.33% <7.69%> (+46.33%) ⬆️
workers/monitor/janitor.py 17.92% <14.28%> (+17.92%) ⬆️
workers/monitor/poller.py 12.13% <12.19%> (+12.13%) ⬆️
shepherd_utils/db.py 56.83% <19.48%> (-18.17%) ⬇️

... and 6 files with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e09b904...fad56ce. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

claude added 4 commits June 25, 2026 23:33
Add a "DB disk" row to the Postgres/Redis infra panel. When
PG_VOLUME_CAPACITY is configured the poller populates disk_used_pct, so the
row shows percent full plus used/capacity (coloured warn >=80%, bad >=90%);
otherwise it falls back to the raw database size.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy
New "DB disk used %" chart in the Infrastructure section, fed by the
pg:disk_used_pct metric already written to the 30-day archive. Y-axis is
fixed 0-100%. The series is empty on deployments that don't set
PG_VOLUME_CAPACITY.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy
The shepherd_brain table grew unbounded: rows only ever flip to COMPLETED or
ABANDONED, never get removed, so the durable query-state table kept growing
even though the callbacks table is reaped on completion and the Redis query
payloads expire via redis_ttl.

Postgres has no native row TTL, so add the scheduled equivalent. A new daily
janitor sweep (sweep_query_retention -> db.purge_old_queries) deletes terminal
queries -- and any leftover callbacks -- older than query_retention_days
(default 30, set 0 to disable). In-flight queries are never touched: only the
abandoned-query reaper moves a stuck query to a terminal state, after which it
becomes eligible. Runs in run_once, so /api/admin/cleanup triggers it too.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy
Stopping the DB exposed two problems: a query took ~60s to return an error,
and the server/monitor flooded the logs polling the dead service.

- Fast failure: every db.py call used pool.connection(60), so acquiring a
  connection blocked the full 60s when the DB was down. Add a configurable
  postgres_pool_timeout (default 5s, was an implicit 60) used as the pool
  default and at every call site, including the monitor poller. Tune up if
  pool timeouts show up under load.

- Quiet the monitor poller: it logged a Postgres failure on every ~3s tick.
  Track reachability process-locally and log the outage once (healthy->down)
  and recovery once, debug in between.

- Quiet the sync-query poll loop: the per-0.5s "failed to get query state"
  warning (which also fires normally while a query is in flight) is now debug.

- Quiet psycopg's pool: it logs a WARNING on every failed reconnect attempt
  to keep min_size connections warm. Raise the psycopg.pool logger to ERROR
  so a DB outage doesn't flood every service's logs; our own code already
  logs the outage once.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy
@maximusunc maximusunc merged commit 95baee9 into main Jun 26, 2026
2 checks passed
@maximusunc maximusunc deleted the claude/affectionate-galileo-njdlbl branch June 26, 2026 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants