Skip to content

Add worker health monitoring with automatic restart on stall #855

Description

@RUKAYAT-CODER

Overview

src/workers/health/worker-health-check.service.ts exists but there is no evidence it triggers automatic recovery when a worker stalls (stops processing jobs without crashing). Silent worker stalls can halt notification delivery or subscription renewals indefinitely.

Specifications

Features:

  • Monitor each worker's last-processed-job timestamp.
  • Trigger a graceful worker restart when the timestamp exceeds a configurable stall threshold.

Tasks:

  • In each worker, update a Redis key worker:heartbeat:{workerId} on every successful job.
  • Create a WorkerStalledDetector scheduled task that checks heartbeats every 60 seconds.
  • If heartbeat is older than WORKER_STALL_THRESHOLD_SECONDS (default 300), emit a worker.stalled event and initiate graceful restart.
  • Add a Prometheus counter worker_restarts_total{worker_name}.

Impacted Files:

  • src/workers/health/worker-health-check.service.ts
  • All processor files.

Acceptance Criteria

  • Stalled worker is restarted within 2x the stall threshold.
  • Prometheus counter increments on each automatic restart.
  • Test simulates a stall by freezing the heartbeat key.

Metadata

Metadata

Assignees

Labels

Stellar WaveIssues in the Stellar wave programbugSomething isn't workingenhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions