Overview
src/workers/health/worker-health-check.service.ts exists but there is no evidence it triggers automatic recovery when a worker stalls (stops processing jobs without crashing). Silent worker stalls can halt notification delivery or subscription renewals indefinitely.
Specifications
Features:
- Monitor each worker's last-processed-job timestamp.
- Trigger a graceful worker restart when the timestamp exceeds a configurable stall threshold.
Tasks:
- In each worker, update a Redis key
worker:heartbeat:{workerId} on every successful job.
- Create a
WorkerStalledDetector scheduled task that checks heartbeats every 60 seconds.
- If heartbeat is older than
WORKER_STALL_THRESHOLD_SECONDS (default 300), emit a worker.stalled event and initiate graceful restart.
- Add a Prometheus counter
worker_restarts_total{worker_name}.
Impacted Files:
src/workers/health/worker-health-check.service.ts
- All processor files.
Acceptance Criteria
- Stalled worker is restarted within 2x the stall threshold.
- Prometheus counter increments on each automatic restart.
- Test simulates a stall by freezing the heartbeat key.
Overview
src/workers/health/worker-health-check.service.tsexists but there is no evidence it triggers automatic recovery when a worker stalls (stops processing jobs without crashing). Silent worker stalls can halt notification delivery or subscription renewals indefinitely.Specifications
Features:
Tasks:
worker:heartbeat:{workerId}on every successful job.WorkerStalledDetectorscheduled task that checks heartbeats every 60 seconds.WORKER_STALL_THRESHOLD_SECONDS(default 300), emit aworker.stalledevent and initiate graceful restart.worker_restarts_total{worker_name}.Impacted Files:
src/workers/health/worker-health-check.service.tsAcceptance Criteria