Surface DB-write failures and add Postgres disk-capacity alerting#109
Merged
Conversation
Three related changes prompted by a Postgres outage where the data volume filled up and queries silently stopped responding: A. Server returns a real error at intake instead of stalling. run_query swallowed failures from add_query/add_task and returned as if the query was accepted, so a full/unavailable datastore left the sync path polling a row that never existed until timeout (~6 min) and the async path returning a fake 200 Accepted for a job that never ran. It now raises QueryIntakeError, which run_sync_query/run_async_query translate into a 500 with a client-safe description. B. Make the failure observable. shepherd_utils.db now classifies psycopg errors by SQLSTATE and logs a stable, greppable PG_DISK_FULL marker for disk-full (53100) across every worker, and the write helpers no longer spin their full retry loop on non-transient errors. C. Early-warning disk-capacity alert. New PG_VOLUME_CAPACITY setting (set from the same Helm value that sizes the PVC) lets the monitor compute how full the volume is from pg_database_size + WAL and fire warning/critical alerts via a new db_capacity rule type. No-op until the env var is set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy
Codecov Report❌ Patch coverage is
... and 6 files with indirect coverage changes Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
Add a "DB disk" row to the Postgres/Redis infra panel. When PG_VOLUME_CAPACITY is configured the poller populates disk_used_pct, so the row shows percent full plus used/capacity (coloured warn >=80%, bad >=90%); otherwise it falls back to the raw database size. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy
New "DB disk used %" chart in the Infrastructure section, fed by the pg:disk_used_pct metric already written to the 30-day archive. Y-axis is fixed 0-100%. The series is empty on deployments that don't set PG_VOLUME_CAPACITY. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy
The shepherd_brain table grew unbounded: rows only ever flip to COMPLETED or ABANDONED, never get removed, so the durable query-state table kept growing even though the callbacks table is reaped on completion and the Redis query payloads expire via redis_ttl. Postgres has no native row TTL, so add the scheduled equivalent. A new daily janitor sweep (sweep_query_retention -> db.purge_old_queries) deletes terminal queries -- and any leftover callbacks -- older than query_retention_days (default 30, set 0 to disable). In-flight queries are never touched: only the abandoned-query reaper moves a stuck query to a terminal state, after which it becomes eligible. Runs in run_once, so /api/admin/cleanup triggers it too. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy
Stopping the DB exposed two problems: a query took ~60s to return an error, and the server/monitor flooded the logs polling the dead service. - Fast failure: every db.py call used pool.connection(60), so acquiring a connection blocked the full 60s when the DB was down. Add a configurable postgres_pool_timeout (default 5s, was an implicit 60) used as the pool default and at every call site, including the monitor poller. Tune up if pool timeouts show up under load. - Quiet the monitor poller: it logged a Postgres failure on every ~3s tick. Track reachability process-locally and log the outage once (healthy->down) and recovery once, debug in between. - Quiet the sync-query poll loop: the per-0.5s "failed to get query state" warning (which also fires normally while a query is in flight) is now debug. - Quiet psycopg's pool: it logs a WARNING on every failed reconnect attempt to keep min_size connections warm. Raise the psycopg.pool logger to ERROR so a DB outage doesn't flood every service's logs; our own code already logs the outage once. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three related changes prompted by a Postgres outage where the data volume
filled up and queries silently stopped responding:
A. Server returns a real error at intake instead of stalling. run_query
swallowed failures from add_query/add_task and returned as if the query
was accepted, so a full/unavailable datastore left the sync path polling
a row that never existed until timeout (~6 min) and the async path
returning a fake 200 Accepted for a job that never ran. It now raises
QueryIntakeError, which run_sync_query/run_async_query translate into a
500 with a client-safe description.
B. Make the failure observable. shepherd_utils.db now classifies psycopg
errors by SQLSTATE and logs a stable, greppable PG_DISK_FULL marker for
disk-full (53100) across every worker, and the write helpers no longer
spin their full retry loop on non-transient errors.
C. Early-warning disk-capacity alert. New PG_VOLUME_CAPACITY setting (set
from the same Helm value that sizes the PVC) lets the monitor compute how
full the volume is from pg_database_size + WAL and fire warning/critical
alerts via a new db_capacity rule type. No-op until the env var is set.
Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01SmSCX6Q8Rz9wMM1iy9xVzy