Skip to content

Add Prometheus metrics for background worker health status#1019

Open
hartz0 wants to merge 1 commit into
solutions-plug:mainfrom
hartz0:feature/background-worker-health-metrics
Open

Add Prometheus metrics for background worker health status#1019
hartz0 wants to merge 1 commit into
solutions-plug:mainfrom
hartz0:feature/background-worker-health-metrics

Conversation

@hartz0

@hartz0 hartz0 commented Jun 28, 2026

Copy link
Copy Markdown

Closes #962

This PR adds comprehensive health monitoring for all background workers running as tokio tasks. Operators can now observe worker status in real-time via Prometheus/Grafana and receive alerts when workers crash, hang, or stop unexpectedly.

Problem Statement

Background workers (email queue, blockchain sync, rate limit cleanup, newsletter cleanup) had no observable health metrics. When a worker crashed or hung, operators had no way to detect the issue from monitoring dashboards, requiring manual log analysis or user reports to discover problems.

Changes Made

1. Metrics Implementation (services/api/src/metrics.rs)

  • Added worker_status IntGaugeVec metric with name label
  • Status values: 1 = running, 0 = stopped
  • Implemented set_worker_status(name, running) method for workers to report health

2. Email Queue Worker (services/api/src/email/queue.rs)

  • Added optional metrics parameter to start_worker() method
  • Implemented heartbeat mechanism (30-second interval)
  • Sets status to 1 on startup
  • Updates status every 30s during normal operation
  • Sets status to 0 on clean shutdown
  • Worker name: email_queue

3. Blockchain Sync Worker (services/api/src/blockchain.rs)

  • Added heartbeat to run_sync_worker() method
  • 30-second heartbeat interval with missed tick handling
  • Reports status using existing metrics from BlockchainClient
  • Sets status to 1 on startup and 0 on shutdown
  • Worker name: blockchain_sync

4. Blockchain Transaction Monitor (services/api/src/blockchain.rs)

  • Added heartbeat to run_transaction_monitor() method
  • Same heartbeat pattern as sync worker
  • Worker name: blockchain_tx_monitor

5. Rate Limiter Cleanup Task (services/api/src/main.rs)

  • Converted to tokio::select! pattern for heartbeat integration
  • 30-second heartbeat alongside 5-minute cleanup interval
  • Fire-and-forget task now reports health
  • Worker name: rate_limiter_cleanup

6. Newsletter Cleanup Task (services/api/src/main.rs)

  • Converted to tokio::select! pattern for heartbeat integration
  • 30-second heartbeat alongside 1-hour cleanup interval
  • Fire-and-forget task now reports health
  • Worker name: newsletter_cleanup

7. Alert Rules (performance/config/alerts.yaml)

  • BackgroundWorkerDown: Critical alert when worker_status == 0 for 1+ minute
  • Fires for any worker that stops or crashes
  • noDataState: alerting ensures crashes are detected even if metrics stop

8. Grafana Dashboard (performance/config/grafana-dashboard.json)

  • Added "Background Worker Status" panel
  • Shows all 5 workers with real-time status
  • Color-coded: Green = RUNNING, Red = STOPPED
  • Includes embedded alert rule for dashboard-level alerting
  • Located below "System Health Overview" panel

- Added worker_status gauge metric with worker name labels
- Each worker reports status: 1=running, 0=stopped
- Implemented heartbeat mechanism (30s interval) for all workers:
  * email_queue - email processing worker
  * blockchain_sync - blockchain event sync worker
  * blockchain_tx_monitor - transaction monitoring worker
  * rate_limiter_cleanup - rate limit cleanup task
  * newsletter_cleanup - newsletter cleanup task
- Workers set status to 1 on startup and 0 on shutdown
- Added BackgroundWorkerDown alert rule (fires after 60s of status=0)
- Added Grafana dashboard panel showing all worker statuses with color coding
- Alert configuration includes noDataState=alerting for crash detection

Closes solutions-plug#962
@drips-wave

drips-wave Bot commented Jun 28, 2026

Copy link
Copy Markdown

@hartz0 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Background worker health status not exposed as metrics

2 participants