Add Prometheus metrics for background worker health status by hartz0 · Pull Request #1019 · solutions-plug/predictIQ

hartz0 · 2026-06-28T06:38:23Z

Closes #962

This PR adds comprehensive health monitoring for all background workers running as tokio tasks. Operators can now observe worker status in real-time via Prometheus/Grafana and receive alerts when workers crash, hang, or stop unexpectedly.

Problem Statement

Background workers (email queue, blockchain sync, rate limit cleanup, newsletter cleanup) had no observable health metrics. When a worker crashed or hung, operators had no way to detect the issue from monitoring dashboards, requiring manual log analysis or user reports to discover problems.

Changes Made

1. Metrics Implementation (`services/api/src/metrics.rs`)

Added worker_status IntGaugeVec metric with name label
Status values: 1 = running, 0 = stopped
Implemented set_worker_status(name, running) method for workers to report health

2. Email Queue Worker (`services/api/src/email/queue.rs`)

Added optional metrics parameter to start_worker() method
Implemented heartbeat mechanism (30-second interval)
Sets status to 1 on startup
Updates status every 30s during normal operation
Sets status to 0 on clean shutdown
Worker name: email_queue

3. Blockchain Sync Worker (`services/api/src/blockchain.rs`)

Added heartbeat to run_sync_worker() method
30-second heartbeat interval with missed tick handling
Reports status using existing metrics from BlockchainClient
Sets status to 1 on startup and 0 on shutdown
Worker name: blockchain_sync

4. Blockchain Transaction Monitor (`services/api/src/blockchain.rs`)

Added heartbeat to run_transaction_monitor() method
Same heartbeat pattern as sync worker
Worker name: blockchain_tx_monitor

5. Rate Limiter Cleanup Task (`services/api/src/main.rs`)

Converted to tokio::select! pattern for heartbeat integration
30-second heartbeat alongside 5-minute cleanup interval
Fire-and-forget task now reports health
Worker name: rate_limiter_cleanup

6. Newsletter Cleanup Task (`services/api/src/main.rs`)

Converted to tokio::select! pattern for heartbeat integration
30-second heartbeat alongside 1-hour cleanup interval
Fire-and-forget task now reports health
Worker name: newsletter_cleanup

7. Alert Rules (`performance/config/alerts.yaml`)

BackgroundWorkerDown: Critical alert when worker_status == 0 for 1+ minute
Fires for any worker that stops or crashes
noDataState: alerting ensures crashes are detected even if metrics stop

8. Grafana Dashboard (`performance/config/grafana-dashboard.json`)

Added "Background Worker Status" panel
Shows all 5 workers with real-time status
Color-coded: Green = RUNNING, Red = STOPPED
Includes embedded alert rule for dashboard-level alerting
Located below "System Health Overview" panel

- Added worker_status gauge metric with worker name labels - Each worker reports status: 1=running, 0=stopped - Implemented heartbeat mechanism (30s interval) for all workers: * email_queue - email processing worker * blockchain_sync - blockchain event sync worker * blockchain_tx_monitor - transaction monitoring worker * rate_limiter_cleanup - rate limit cleanup task * newsletter_cleanup - newsletter cleanup task - Workers set status to 1 on startup and 0 on shutdown - Added BackgroundWorkerDown alert rule (fires after 60s of status=0) - Added Grafana dashboard panel showing all worker statuses with color coding - Alert configuration includes noDataState=alerting for crash detection Closes solutions-plug#962

drips-wave · 2026-06-28T06:38:33Z

@hartz0 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Prometheus metrics for background worker health status#1019

Add Prometheus metrics for background worker health status#1019
hartz0 wants to merge 1 commit into
solutions-plug:mainfrom
hartz0:feature/background-worker-health-metrics

hartz0 commented Jun 28, 2026

Uh oh!

drips-wave Bot commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hartz0 commented Jun 28, 2026

Problem Statement

Changes Made

1. Metrics Implementation (services/api/src/metrics.rs)

2. Email Queue Worker (services/api/src/email/queue.rs)

3. Blockchain Sync Worker (services/api/src/blockchain.rs)

4. Blockchain Transaction Monitor (services/api/src/blockchain.rs)

5. Rate Limiter Cleanup Task (services/api/src/main.rs)

6. Newsletter Cleanup Task (services/api/src/main.rs)

7. Alert Rules (performance/config/alerts.yaml)

8. Grafana Dashboard (performance/config/grafana-dashboard.json)

Uh oh!

drips-wave Bot commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Metrics Implementation (`services/api/src/metrics.rs`)

2. Email Queue Worker (`services/api/src/email/queue.rs`)

3. Blockchain Sync Worker (`services/api/src/blockchain.rs`)

4. Blockchain Transaction Monitor (`services/api/src/blockchain.rs`)

5. Rate Limiter Cleanup Task (`services/api/src/main.rs`)

6. Newsletter Cleanup Task (`services/api/src/main.rs`)

7. Alert Rules (`performance/config/alerts.yaml`)

8. Grafana Dashboard (`performance/config/grafana-dashboard.json`)