Skip to content

Commit fd025ed

Browse files
CCM-14044 Adding eventpub anom alarms config
1 parent c49f0e5 commit fd025ed

4 files changed

Lines changed: 120 additions & 0 deletions

File tree

infrastructure/terraform/modules/eventpub/README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,44 @@
1+
# EventPub Module
2+
3+
## Overview
4+
5+
The `eventpub` module provides a centralized event publishing infrastructure for NHS Notify bounded contexts. It creates an SNS topic with configurable subscribers (Lambda, Firehose, SQS) and includes comprehensive monitoring via CloudWatch alarms.
6+
7+
```
8+
┌─────────────────┐
9+
│ Service Lambda │
10+
│ (Publisher) │
11+
└────────┬────────┘
12+
│ publishes to
13+
14+
┌─────────────────────────┐
15+
│ SNS Topic │
16+
│ (eventpub module) │
17+
│ │
18+
│ - Anomaly Detection │
19+
│ - Delivery Logging │
20+
│ - KMS Encryption │
21+
└─────────┬───────────────┘
22+
│ fan-out to:
23+
├─────────────────────────┐
24+
│ │
25+
▼ ▼
26+
┌─────────────────┐ ┌──────────────────┐
27+
│ Kinesis │ │ EventBridge │
28+
│ Firehose │ │ Rules │
29+
│ ↓ S3 │ │ ↓ Subscribers │
30+
│ (Event Cache) │ │ (SQS/Lambda) │
31+
└─────────────────┘ └──────────────────┘
32+
│ │
33+
▼ ▼
34+
┌─────────────────┐ ┌──────────────────┐
35+
│ CloudWatch │ │ CloudWatch │
36+
│ - DLQ Alarm │ │ - Anomaly │
37+
│ - Delivery │ │ Detection │
38+
│ Failures │ │ │
39+
└─────────────────┘ └──────────────────┘
40+
```
41+
142
<!-- BEGIN_TF_DOCS -->
243
<!-- markdownlint-disable -->
344
<!-- vale off -->
@@ -19,6 +60,7 @@
1960
| <a name="input_default_tags"></a> [default\_tags](#input\_default\_tags) | Default tag map for application to all taggable resources in the module | `map(string)` | `{}` | no |
2061
| <a name="input_enable_event_cache"></a> [enable\_event\_cache](#input\_enable\_event\_cache) | Enable caching of events to an S3 bucket | `bool` | `false` | no |
2162
| <a name="input_enable_firehose_raw_message_delivery"></a> [enable\_firehose\_raw\_message\_delivery](#input\_enable\_firehose\_raw\_message\_delivery) | Enables raw message delivery on firehose subscription | `bool` | `false` | no |
63+
| <a name="input_enable_event_publishing_anomaly_detection"></a> [enable\_publishing\_anomaly\_detection](#input\_enable\_publishing\_anomaly\_detection) | Enable CloudWatch anomaly detection alarm for SNS message publishing. Detects abnormal drops or spikes in event publishing volume. | `bool` | `true` | no |
2264
| <a name="input_enable_sns_delivery_logging"></a> [enable\_sns\_delivery\_logging](#input\_enable\_sns\_delivery\_logging) | Enable SNS Delivery Failure Notifications | `bool` | `false` | no |
2365
| <a name="input_environment"></a> [environment](#input\_environment) | The name of the terraformscaffold environment the module is called for | `string` | n/a | yes |
2466
| <a name="input_event_cache_buffer_interval"></a> [event\_cache\_buffer\_interval](#input\_event\_cache\_buffer\_interval) | The buffer interval for data firehose | `number` | `500` | no |
@@ -31,6 +73,9 @@
3173
| <a name="input_log_retention_in_days"></a> [log\_retention\_in\_days](#input\_log\_retention\_in\_days) | The retention period in days for the Cloudwatch Logs events generated by the lambda function | `number` | n/a | yes |
3274
| <a name="input_name"></a> [name](#input\_name) | A unique name to distinguish this module invocation from others within the same CSI scope | `string` | n/a | yes |
3375
| <a name="input_project"></a> [project](#input\_project) | The name of the terraformscaffold project calling the module | `string` | n/a | yes |
76+
| <a name="input_event_publishing_anomaly_band_width"></a> [publishing\_anomaly\_band\_width](#input\_publishing\_anomaly\_band\_width) | The width of the anomaly detection band. Higher values (e.g., 4-6) reduce sensitivity and noise, lower values (e.g., 2-3) increase sensitivity. Recommended: 2-4 depending on traffic patterns. | `number` | `3` | no |
77+
| <a name="input_event_publishing_anomaly_evaluation_periods"></a> [publishing\_anomaly\_evaluation\_periods](#input\_publishing\_anomaly\_evaluation\_periods) | Number of evaluation periods for the publishing anomaly alarm. Each period is defined by event_publishing_anomaly_period. | `number` | `2` | no |
78+
| <a name="input_event_publishing_anomaly_period"></a> [publishing\_anomaly\_period](#input\_publishing\_anomaly\_period) | The period in seconds over which the specified statistic is applied for anomaly detection. Minimum 300 seconds (5 minutes). Recommended: 300-600 for event-driven workloads. | `number` | `300` | no |
3479
| <a name="input_region"></a> [region](#input\_region) | The AWS Region | `string` | n/a | yes |
3580
| <a name="input_sns_success_logging_sample_percent"></a> [sns\_success\_logging\_sample\_percent](#input\_sns\_success\_logging\_sample\_percent) | Enable SNS Delivery Successful Sample Percentage | `number` | `0` | no |
3681
## Modules
@@ -42,6 +87,7 @@
4287

4388
| Name | Description |
4489
|------|-------------|
90+
| <a name="output_publishing_anomaly_alarm"></a> [publishing\_anomaly\_alarm](#output\_publishing\_anomaly\_alarm) | CloudWatch anomaly detection alarm details for SNS publishing |
4591
| <a name="output_s3_bucket_event_cache"></a> [s3\_bucket\_event\_cache](#output\_s3\_bucket\_event\_cache) | S3 Bucket ARN and Name for event cache |
4692
| <a name="output_sns_topic"></a> [sns\_topic](#output\_sns\_topic) | SNS Topic ARN and Name |
4793
<!-- vale on -->
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
resource "aws_cloudwatch_metric_alarm" "publishing_anomaly" {
2+
count = var.enable_event_publishing_anomaly_detection ? 1 : 0
3+
4+
alarm_name = "${local.csi}-sns-publishing-anomaly"
5+
alarm_description = "RELIABILITY: Anomaly detection alarm for abnormal SNS message publishing patterns. Detects unexpected drops or spikes in event publishing volume that may indicate service degradation or misconfiguration."
6+
comparison_operator = "LessThanLowerOrGreaterThanUpperThreshold"
7+
evaluation_periods = var.event_publishing_anomaly_evaluation_periods # Number of evaluation periods for the publishing anomaly alarm.
8+
threshold_metric_id = "ad1"
9+
treat_missing_data = "notBreaching"
10+
actions_enabled = true
11+
12+
tags = merge(
13+
local.default_tags,
14+
{
15+
AlarmType = "AnomalyDetection"
16+
AlarmPurpose = "EventPublishingAbnormality"
17+
}
18+
)
19+
20+
metric_query {
21+
id = "m1"
22+
return_data = true
23+
24+
metric {
25+
metric_name = "NumberOfMessagesPublished"
26+
namespace = "AWS/SNS"
27+
period = var.event_publishing_anomaly_period # The period in seconds over which the specified statistic is applied for anomaly detection.
28+
stat = "Sum"
29+
30+
dimensions = {
31+
TopicName = aws_sns_topic.main.name
32+
}
33+
}
34+
}
35+
36+
metric_query {
37+
id = "ad1"
38+
expression = "ANOMALY_DETECTION_BAND(m1, ${var.event_publishing_anomaly_band_width})" # The width of the anomaly detection band. Higher values (e.g. 4-6) reduce sensitivity and noise, lower values (e.g. 2-3) increase sensitivity.
39+
label = "NumberOfMessagesPublished (expected)"
40+
return_data = true
41+
}
42+
}

infrastructure/terraform/modules/eventpub/outputs.tf

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,11 @@ output "s3_bucket_event_cache" {
1313
bucket = module.s3bucket_event_cache[0].bucket
1414
} : {}
1515
}
16+
17+
output "publishing_anomaly_alarm" {
18+
description = "CloudWatch anomaly detection alarm details for SNS publishing"
19+
value = var.enable_event_publishing_anomaly_detection ? {
20+
arn = aws_cloudwatch_metric_alarm.publishing_anomaly[0].arn
21+
name = aws_cloudwatch_metric_alarm.publishing_anomaly[0].alarm_name
22+
} : null
23+
}

infrastructure/terraform/modules/eventpub/variables.tf

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,3 +129,27 @@ variable "additional_policies_for_event_cache_bucket" {
129129
description = "A list of JSON policies to use to build the bucket policy"
130130
default = []
131131
}
132+
133+
variable "enable_event_publishing_anomaly_detection" {
134+
type = bool
135+
description = "Enable CloudWatch anomaly detection alarm for SNS message publishing. Detects abnormal drops or spikes in event publishing volume."
136+
default = true
137+
}
138+
139+
variable "event_publishing_anomaly_evaluation_periods" {
140+
type = number
141+
description = "Number of evaluation periods for the publishing anomaly alarm. Each period is defined by event_publishing_anomaly_period."
142+
default = 2
143+
}
144+
145+
variable "event_publishing_anomaly_period" {
146+
type = number
147+
description = "The period in seconds over which the specified statistic is applied for anomaly detection. Minimum 300 seconds (5 minutes). Recommended: 300-600."
148+
default = 300
149+
}
150+
151+
variable "event_publishing_anomaly_band_width" {
152+
type = number
153+
description = "The width of the anomaly detection band. Higher values (e.g. 4-6) reduce sensitivity and noise, lower values (e.g. 2-3) increase sensitivity. Recommended: 2-4."
154+
default = 3
155+
}

0 commit comments

Comments
 (0)