Skip to content

fix: add CacheCircuitBreakerOpen Prometheus alert (#966)#1015

Open
euniceamoni wants to merge 3 commits into
solutions-plug:mainfrom
euniceamoni:fix/966-cache-circuit-breaker-alert
Open

fix: add CacheCircuitBreakerOpen Prometheus alert (#966)#1015
euniceamoni wants to merge 3 commits into
solutions-plug:mainfrom
euniceamoni:fix/966-cache-circuit-breaker-alert

Conversation

@euniceamoni

@euniceamoni euniceamoni commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

What

Adds a Prometheus alert for when the Redis cache circuit breaker enters the open state, which causes all cache reads to fail-open and increases load on database and RPC endpoints.

Changes

  • services/api/src/metrics.rs: Add IntGauge metric (0=closed, 1=open, 2=half_open) and setter
  • services/api/src/handlers.rs: Update health handler to set the circuit breaker state metric on each request
  • performance/config/alerts.yaml: Add alert in the group

Alert Details

Acceptance Criteria

  • Alert fires when for more than 2 minutes
  • Severity set to for immediate operator response
  • Runbook link added to alert annotation

closes #966

…lug#960)

- Move ecs_tasks SG to root module to avoid circular dependencies
- ALB SG: restrict egress to container port → ecs_tasks SG only
- ecs_tasks SG: egress limited to 5432 (RDS), 6379 (Redis), 443 (AWS APIs)
- RDS SG: replace 10.0.0.0/8 ingress with ecs_tasks SG reference, remove broad egress
- Redis SG: replace 10.0.0.0/8 ingress with ecs_tasks SG reference, remove broad egress
- Add Checkov CI scan job that fails on HIGH/CRITICAL findings before terraform plan
…l and alert

- Add email_queue_depth IntGauge to Metrics struct
- Emit the gauge in the email worker loop after each dequeue cycle
- Also update the gauge on each /api/v1/email/queue/stats request
- Add Grafana panel for queue depth visualization
- Add Prometheus alert when queue depth exceeds 100 for 5 minutes

Resolves solutions-plug#961
- Add cache_circuit_breaker_state IntGauge metric (0=closed, 1=open, 2=half_open)
- Update health handler to set circuit breaker state metric
- Add CacheCircuitBreakerOpen alert with severity page and 2m for duration

Fixes solutions-plug#966
@drips-wave

drips-wave Bot commented Jun 27, 2026

Copy link
Copy Markdown

@euniceamoni Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No Prometheus alert for cache circuit breaker open state

1 participant