Add Prometheus metrics for background worker health status#1019
Open
hartz0 wants to merge 1 commit into
Open
Conversation
- Added worker_status gauge metric with worker name labels - Each worker reports status: 1=running, 0=stopped - Implemented heartbeat mechanism (30s interval) for all workers: * email_queue - email processing worker * blockchain_sync - blockchain event sync worker * blockchain_tx_monitor - transaction monitoring worker * rate_limiter_cleanup - rate limit cleanup task * newsletter_cleanup - newsletter cleanup task - Workers set status to 1 on startup and 0 on shutdown - Added BackgroundWorkerDown alert rule (fires after 60s of status=0) - Added Grafana dashboard panel showing all worker statuses with color coding - Alert configuration includes noDataState=alerting for crash detection Closes solutions-plug#962
|
@hartz0 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits. You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #962
This PR adds comprehensive health monitoring for all background workers running as tokio tasks. Operators can now observe worker status in real-time via Prometheus/Grafana and receive alerts when workers crash, hang, or stop unexpectedly.
Problem Statement
Background workers (email queue, blockchain sync, rate limit cleanup, newsletter cleanup) had no observable health metrics. When a worker crashed or hung, operators had no way to detect the issue from monitoring dashboards, requiring manual log analysis or user reports to discover problems.
Changes Made
1. Metrics Implementation (
services/api/src/metrics.rs)worker_statusIntGaugeVec metric withnamelabel1= running,0= stoppedset_worker_status(name, running)method for workers to report health2. Email Queue Worker (
services/api/src/email/queue.rs)metricsparameter tostart_worker()method1on startup0on clean shutdownemail_queue3. Blockchain Sync Worker (
services/api/src/blockchain.rs)run_sync_worker()methodBlockchainClient1on startup and0on shutdownblockchain_sync4. Blockchain Transaction Monitor (
services/api/src/blockchain.rs)run_transaction_monitor()methodblockchain_tx_monitor5. Rate Limiter Cleanup Task (
services/api/src/main.rs)rate_limiter_cleanup6. Newsletter Cleanup Task (
services/api/src/main.rs)newsletter_cleanup7. Alert Rules (
performance/config/alerts.yaml)worker_status == 0for 1+ minutenoDataState: alertingensures crashes are detected even if metrics stop8. Grafana Dashboard (
performance/config/grafana-dashboard.json)