Skip to content

Merge latest changes from topic-release to main#216

Open
RLIndia wants to merge 29 commits into
mainfrom
topic-release
Open

Merge latest changes from topic-release to main#216
RLIndia wants to merge 29 commits into
mainfrom
topic-release

Conversation

@RLIndia
Copy link
Copy Markdown
Contributor

@RLIndia RLIndia commented Jun 1, 2026

  1. Deployment support for Egress Lambda
  2. New entries in settings-config.json
  3. Updated standardcatalogitems collection
  4. New upload-assets.sh script to upload items to template bucket.
  5. Fix: not able to attach cross account policy to rg portal role.
  6. Update the base account policy to modify the S3 bucket policy
  7. Updated PCS Quick Start template.

This change is Reviewable

Summary by CodeRabbit

  • New Features

    • Numerous new infrastructure templates for SageMaker, Bedrock, PCS (HPC), EC2 variants (Rocky NICE DCV, Windows Remote Desktop, Nginx) and quick-start clusters.
    • Egress archival service: serverless function to package/store S3 data with encrypted notifications and a secured endpoint.
    • Expanded product catalog and UI to surface new services and connect/actions (Bedrock, SageMaker AI, PCS, remote desktops, EC2 variants).
  • Chores

    • Added asset upload/deploy utilities and a deploy option to skip S3 uploads.
    • Updated runtime configuration and IAM/KMS permissions to support new templates and services.

Anusha-janardhan and others added 23 commits July 2, 2025 15:30
Added new Lambdafunction.yml to support Lambda function deployment for egress workflow
…to support new products

fix(bootstrap): restore S3 mounts check in bootstrap script
feat(settings): added settings configuration in settings-config.json to support new products
feat(catalog): remove deprecated catalog items and update Amazon Sagemaker name
… the command to pull the latest version of FUSE, and add permission to describe VPCs in the project account Role.
Update the base account policy to modify the S3 bucket policy, update the command to pull the latest version of FUSE, and add permission to describe VPCs in the project account Role.
Fix: not able to attach cross account policy to rg portal role.
…arameters; enhance bootstrap script for environment instance setup.
Update pcs_quick_lt.yaml: Add new IAM policy document and S3 mounts parameters; enhance bootstrap script for environment instance setup.
Merge updates from main to topic-release
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

Warning

Review limit reached

@RLIndia, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 1 minute and 35 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c33e204c-fc23-491d-90b2-a99e8ea5e8bb

📥 Commits

Reviewing files that changed from the base of the PR and between 74724d0 and de1b8af.

📒 Files selected for processing (1)
  • cft-templates/sagemakerUserProfile.yaml
📝 Walkthrough

Walkthrough

This PR expands the RGDeploy platform with multiple new cloud services: an S3-based Lambda file archival system, EC2 remote desktop and HPC cluster provisioning, SageMaker domain/user profile setup, and Bedrock AI integration. Updates include CloudFormation templates, Lambda functions, configuration registrations, and deployment automation.

Changes

Multi-Service Platform Expansion

Layer / File(s) Summary
Egress Store Lambda & S3 Integration
SRE/lambdaresources/index.mjs, SRE/lambdaresources/package.json, SRE/Egress/Setup-templates/egressresources.yml
Lambda streams S3 objects into ZIP archives and uploads to destination buckets. CloudFormation defines KMS-encrypted buckets, SNS topic, IAM execution role, Lambda Function (from S3) with Function URL (IAM auth), managed invoke policy, and outputs for created resource identifiers.
EC2 Service Templates
cft-templates/ec2-nginxWebServer.yml, cft-templates/ec2-rockyLinux.yml, cft-templates/windows-remotedesktop.yml
Three new EC2 provisioning templates supporting Nginx web server, Rocky Linux with NICE DCV, and Windows Remote Desktop. Each includes parameterized IAM roles, security groups, user-data bootstrap, and exported outputs.
AWS PCS HPC Cluster Templates
cft-templates/pcs.yaml, cft-templates/pcs_quick_lt.yaml
Full and quick-start PCS (Slurm) templates provisioning login/compute node groups with conditional Slurm accounting, IAM roles/instance profiles, EC2 Launch Templates, and console/ID outputs.
SageMaker Domain & User Profile Templates
cft-templates/awsSagemakerDomainAI.yaml, cft-templates/sagemakerUserProfile.yaml
SageMaker Domain template with VPC/networking and execution role; UserProfile template includes a Lambda-backed custom resource to create/manage SageMaker Spaces and JupyterLab apps with state polling.
Bedrock Execution Role
cft-templates/bedrock.yaml
CloudFormation template defining a Bedrock execution IAM role with managed and inline policies and exported role ARN.
IAM Policy & Role Updates
config/access-policy.json, config/access-secure-policy.json, rg_main_stack.yml
Adds ec2:DescribeVpcs to access policies; extends KMS key management actions and corrects role resource references in the main stack IAM policy.
Service Configuration & Catalog Updates
config/config.json, config/settings-config.json, dump/configs.json, dump/standardcatalogitems.json
Registers new services (bedrock, sagemaker-ai, aws-pcs, ec2-remote-desktop, ec2-vscode, ec2-igv, etc.), updates action/state definitions, introduces session/event trigger config keys, increases operational limits, and extends product catalog entries.
Deployment & Bootstrap Script Updates
deploy.sh, upload-assets.sh, scripts/bootstrap-scripts/bootstrap.sh
Added upload-assets.sh to stage artifacts to S3; deploy.sh adds SKIP_S3_COPY conditional to skip or validate S3 uploads; bootstrap script repositions S3 mount guard and updates FUSE installation for certain environments.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant LambdaURL as Egress Lambda Function URL
  participant EgressLambda as EgressLambda
  participant S3 as S3
  Client->>LambdaURL: POST {sourceBucket, sourcePrefix, destinationBucket, destinationPrefix}
  LambdaURL->>EgressLambda: Invoke (IAM auth)
  EgressLambda->>S3: ListObjectsV2(sourceBucket, sourcePrefix)
  alt no objects
    EgressLambda-->>Client: 404 Not Found
  else objects found
    EgressLambda->>EgressLambda: Create ZIP (archiver)
    EgressLambda->>S3: Upload ZIP stream to destinationBucket/destinationPrefix
    EgressLambda-->>Client: 201 Created
  end
Loading

🎯 4 (Complex) | ⏱️ ~75 minutes

🐰 New services hop into place,
CloudFormation stacks embrace the space,
Lambda zips files with grace,
HPC clusters find their trace,
SageMaker and Bedrock join the race!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Merge latest changes from topic-release to main' is generic and does not clearly convey the main purpose of this substantial changeset, which includes egress Lambda deployment support, new product configurations, and multiple CloudFormation templates. Consider using a more descriptive title that captures the primary change, such as 'Add egress Lambda deployment and expand product ecosystem support' or 'Support egress Lambda and add new AWS product templates (PCS, Bedrock, SageMaker)' to better reflect the scope of changes.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch topic-release

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
config/settings-config.json (1)

1502-1524: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add config blocks for the newly listed services.

services now includes ec2-spyder and sagemaker-studio, but this file has no matching top-level definitions for either key. Any lookup by service name will resolve to undefined, so those products will not get action/state mappings in the UI.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/settings-config.json` around lines 1502 - 1524, The services array now
lists "ec2-spyder" and "sagemaker-studio" but there are no corresponding
top-level config objects for those service keys; add two new top-level entries
named "ec2-spyder" and "sagemaker-studio" that mirror the structure used by the
other service configs (include the same action/state mappings, displayName,
icon/key, and any required UI metadata) so lookups by service name return the
expected action/state mappings; ensure the new blocks follow the same property
names and schema as existing entries (e.g., how "ec2" or "sagemaker" are
defined) so the UI can consume them.
SRE/Egress/Setup-templates/egressresources.yml (1)

27-38: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Allow SNS to use the customer-managed KMS key.

In SRE/Egress/Setup-templates/egressresources.yml, EgressNotificationTopic is encrypted with EgressStoreEncryptionKey, but the CMK key policy currently grants only the account root principal. For SNS-encrypted topics, the sns.amazonaws.com service principal must be allowed kms:GenerateData* (via kms:GenerateDataKey*) and kms:Decrypt so publishes can succeed (scoped to the specific topic via a Condition).

Suggested fix
       KeyPolicy:
         Version: "2012-10-17"
         Id: egress-store-kms-policy
         Statement:
           - Sid: Allow root
             Effect: Allow
             Principal:
               AWS:
                 - !Sub "arn:aws:iam::${AWS::AccountId}:root"
             Action:
               - "kms:*"
             Resource: "*"
+          - Sid: AllowSnsToUseKeyForEgressTopic
+            Effect: Allow
+            Principal:
+              Service: sns.amazonaws.com
+            Action:
+              - "kms:Decrypt"
+              - "kms:GenerateDataKey*"
+            Resource: "*"
+            Condition:
+              StringEquals:
+                aws:SourceAccount: !Ref AWS::AccountId
+              ArnLike:
+                aws:SourceArn: !Ref EgressNotificationTopic
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@SRE/Egress/Setup-templates/egressresources.yml` around lines 27 - 38, The KMS
key policy (KeyPolicy on EgressStoreEncryptionKey) only allows the account root;
add a new Statement permitting the sns.amazonaws.com service principal to call
kms:GenerateDataKey* and kms:Decrypt for the EgressNotificationTopic.
Specifically, add a Statement under KeyPolicy that sets Principal.Service to
"sns.amazonaws.com", Action to ["kms:GenerateDataKey*", "kms:Decrypt"], Effect
to Allow, and include a Condition that scopes the permission to the specific
topic (e.g., StringEquals on the SNS encryption context or aws:SourceArn
referencing EgressNotificationTopic) so publishes to EgressNotificationTopic
succeed while remaining scoped to that topic.
🟠 Major comments (23)
cft-templates/bedrock.yaml-64-76 (1)

64-76: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Security: IAM role creation and PassRole permissions enable privilege escalation.

The combination of iam:CreateRole, iam:AttachRolePolicy, iam:PutRolePolicy, and iam:PassRole on Resource: "*" allows this role to create new IAM roles with arbitrary permissions and pass them to AWS services. This is a privilege escalation risk.

If Bedrock Agents or Knowledge Bases require creating service-linked roles, scope these permissions down to specific resource patterns (e.g., arn:aws:iam::${AWS::AccountId}:role/service-role/AmazonBedrock*). If these permissions are not required, remove them entirely.

🔒 Recommended fix to scope IAM permissions
               - Effect: Allow
                 Action:
                   - "iam:CreateRole"
                   - "iam:AttachRolePolicy"
                   - "iam:PutRolePolicy"
+                  - "iam:GetRole"
+                  - "iam:PassRole"
+                  - "iam:ListAttachedRolePolicies"
+                Resource:
+                  - !Sub "arn:aws:iam::${AWS::AccountId}:role/service-role/AmazonBedrock*"
+              - Effect: Allow
+                Action:
                   - "lambda:ListFunctions"
                   - "lambda:GetFunction"
                   - "lambda:InvokeFunction"
-                  - "iam:CreateRole"
-                  - "iam:GetRole"
-                  - "iam:PassRole"
-                  - "iam:ListAttachedRolePolicies"
                 Resource: "*"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/bedrock.yaml` around lines 64 - 76, The IAM statement currently
grants iam:CreateRole, iam:AttachRolePolicy, iam:PutRolePolicy, iam:PassRole
(plus related iam:GetRole, iam:ListAttachedRolePolicies and lambda actions) with
Resource: "*", which allows privilege escalation; update the policy so that
iam:CreateRole/AttachRolePolicy/PutRolePolicy/PassRole are either removed if not
required or scoped to a narrow ARN pattern (for example only service-linked
Bedrock roles such as
arn:aws:iam::${AWS::AccountId}:role/service-role/AmazonBedrock*), keep read-only
actions (iam:GetRole, iam:ListAttachedRolePolicies, lambda:ListFunctions,
lambda:GetFunction, lambda:InvokeFunction) as needed, and ensure Resource
entries reference the specific role ARNs rather than "*" to eliminate broad role
creation/passing rights.
cft-templates/bedrock.yaml-29-34 (1)

29-34: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Security: Allowing account root to assume role is overly permissive.

Allowing the account root principal (arn:aws:iam::${AWS::AccountId}:root) to assume this role means any IAM identity in the account with sts:AssumeRole permission can assume this role. This violates the principle of least privilege and could lead to unauthorized access.

If cross-account or cross-service access is not required, consider restricting the trust policy to only bedrock.amazonaws.com. If specific IAM principals need access, explicitly list them instead of using the root principal.

🛡️ Recommended fix to remove root principal
       AssumeRolePolicyDocument:
         Version: "2012-10-17"
         Statement:
           - Effect: Allow
             Principal:
               Service: bedrock.amazonaws.com
             Action: sts:AssumeRole
-          - Effect: Allow
-            Principal:
-              AWS: !Sub "arn:aws:iam::${AWS::AccountId}:root"
-            Action:
-              - sts:AssumeRole
-              - sts:TagSession
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/bedrock.yaml` around lines 29 - 34, The trust policy currently
allows the account root principal ("arn:aws:iam::${AWS::AccountId}:root") to
assume the role which is overly permissive; update the Principal section of the
role's trust policy to remove that root ARN and instead specify a narrow
principal such as Service: "bedrock.amazonaws.com" if only Bedrock needs access,
or explicitly list the specific IAM roles/users that require sts:AssumeRole and
sts:TagSession; locate the Principal block and the Action entries
(sts:AssumeRole, sts:TagSession) in the bedrock.yaml trust policy and replace
the AWS root principal with the appropriate Service principal or an explicit
list of allowed IAM principals.
cft-templates/bedrock.yaml-36-37 (1)

36-37: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Security: AmazonS3FullAccess grants unrestricted access to all S3 buckets.

The AmazonS3FullAccess managed policy grants full access to all S3 buckets and objects in the account. This is a significant security risk and violates the principle of least privilege. Bedrock workloads typically require access to specific S3 buckets for data sources, model artifacts, or outputs—not all buckets.

Replace this with a custom inline policy that grants access only to the specific S3 buckets needed for Bedrock operations.

🔒 Recommended fix with scoped S3 permissions
       ManagedPolicyArns:
         - arn:aws:iam::aws:policy/AmazonBedrockFullAccess
-        - arn:aws:iam::aws:policy/AmazonS3FullAccess
       Policies:
         - PolicyName: BedrockExecutionRolePolicy
           PolicyDocument:
             Version: "2012-10-17"
             Statement:
+              - Effect: Allow
+                Action:
+                  - "s3:GetObject"
+                  - "s3:PutObject"
+                  - "s3:DeleteObject"
+                Resource:
+                  - "arn:aws:s3:::your-bedrock-data-bucket/*"
+                  - "arn:aws:s3:::your-bedrock-model-bucket/*"
+              - Effect: Allow
+                Action:
+                  - "s3:ListBucket"
+                Resource:
+                  - "arn:aws:s3:::your-bedrock-data-bucket"
+                  - "arn:aws:s3:::your-bedrock-model-bucket"
               - Effect: Allow

Replace your-bedrock-data-bucket and your-bedrock-model-bucket with the actual bucket names or use parameters to make this configurable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/bedrock.yaml` around lines 36 - 37, Replace the overly broad
managed policy arn:aws:iam::aws:policy/AmazonS3FullAccess in the bedrock.yaml
policy list with a scoped inline IAM policy that grants least-privilege S3
actions only on the specific buckets used by Bedrock (e.g., GetObject,
PutObject, ListBucket on your-bedrock-data-bucket and
your-bedrock-model-bucket), keep the existing AmazonBedrockFullAccess entry, and
make bucket names configurable via CloudFormation parameters; update the role or
policy resource (the location where AmazonS3FullAccess is referenced) to attach
this inline policy instead of the managed ARN.
dump/configs.json-596-605 (1)

596-605: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the duplicated chatAssistantConfig seed record.

This block repeats the same _id and key that already appear earlier in the dump. Re-importing this file will either fail with a duplicate-key error or leave seed state dependent on import behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dump/configs.json` around lines 596 - 605, Remove the duplicated seed
document for the config with key "chatAssistantConfig" and _id
"663b604476964689564dc998" from the JSON dump so only one instance remains;
locate the object containing "key": "chatAssistantConfig" (and the matching
"$oid": "663b604476964689564dc998") and delete that entire object (and any
trailing comma it leaves) to avoid duplicate-key import errors.
dump/configs.json-606-612 (1)

606-612: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the standard value field name for kmsKeys.

This document stores its payload under Value, but the rest of the config dump uses value. If the loader reads value, this new setting will be ignored at runtime.

Suggested fix
-    "Value": [ ]
+    "value": []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dump/configs.json` around lines 606 - 612, The document for key "kmsKeys"
uses the nonstandard field name "Value" instead of the expected "value", so the
config loader will ignore it; update the document payload for the "kmsKeys"
entry to use the lowercase "value" field (or add a "value" alias mirroring
"Value") so the loader finds the setting at runtime.
deploy.sh-228-308 (1)

228-308: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate bucket contents before honoring skip_s3_copy.

Skipping this block on a fresh or partially staged bucket leaves the templates, lambdas, bootstrap scripts, and seed data absent, but the rest of the deploy still assumes they exist. That turns the flag into a late deployment failure instead of a safe fast-path for repeat runs.

Suggested guard
 if ! [ "$SKIP_S3_COPY" = "true" ]; then
   echo "Copying deployment files to bucket $bucketname"
   ...
 else
-  echo "Skipping S3 copy as per user request"
+  required_objects=(
+    docker-compose.yml
+    nginx.conf
+    updatescripts.sh
+    config.tar.gz
+    scripts.tar.gz
+    dump.zip
+  )
+  for object in "${required_objects[@]}"; do
+    aws s3api head-object --bucket "$bucketname" --key "$object" >/dev/null 2>&1 || {
+      echo "Cannot skip S3 copy: s3://$bucketname/$object is missing."
+      exit 1
+    }
+  done
+  echo "Skipping S3 copy as per user request"
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy.sh` around lines 228 - 308, When SKIP_S3_COPY is true the script
currently skips uploading but does not validate that required objects exist in
the target bucket; add a validation step in the else branch that checks for the
presence of all required S3 objects (use bucketname and the same object keys
referenced in the upload section: docker-compose.yml, nginx.conf,
updatescripts.sh, cft-templates/ (sync target), config.tar.gz or config/*,
scripts.tar.gz, bootstrap-scripts/, the lambda zips, product tarballs and
dump.zip) by calling aws s3 ls or aws s3api head-object for each key; if any are
missing, print a clear error and exit non‑zero instructing the user to run
without SKIP_S3_COPY or re-run the upload so downstream steps relying on those
assets don’t fail.
upload-assets.sh-62-76 (1)

62-76: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Uploads lack error handling—partial failures go unnoticed.

If any aws s3 cp or aws s3 sync command fails (e.g., network timeout, permissions), the script continues and exits with success. This could leave the bucket in an inconsistent state.

Consider adding set -e near the top, or appending || exit 1 to critical upload commands, or wrapping uploads in a function that checks return codes.

Example: add errexit at script start
 #!/bin/bash
+set -euo pipefail
 
 # Verifying required utilities

This causes the script to exit immediately on any command failure.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@upload-assets.sh` around lines 62 - 76, The script performs many uploads (aws
s3 cp, aws s3 sync) and tar operations without checking failures, so partial
failures are possible; modify the script to fail-fast by enabling errexit (add
set -euo pipefail near the top) or wrap critical operations (aws s3 cp, aws s3
sync, tar -czf, and rm -f) in a helper function that checks the command exit
status and exits with a non‑zero code on failure (reference the aws s3 cp/aws s3
sync invocations and the tar -czf config.tar.gz / tar -tf / rm -f sequence and
variables localhome and bucketname when locating where to apply the checks).
upload-assets.sh-50-56 (1)

50-56: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Redundant wait after cloudformation deploy.

aws cloudformation deploy is synchronous by default—it already waits for the stack operation to complete before returning. The subsequent wait stack-create-complete is redundant and can fail if the stack was updated rather than created (wrong waiter). Check the deploy exit code directly instead.

Proposed fix
     echo "Bucket $bucketname does not exist. Creating it using CloudFormation..."
-    aws cloudformation deploy --template-file rg_deploy_bucket.yml --stack-name "$bucketstackname" \
-        --parameter-overrides S3NewBucketName="$bucketname"
-    aws cloudformation wait stack-create-complete --stack-name "$bucketstackname"
-    if [ $? -ne 0 ]; then
+    if ! aws cloudformation deploy --template-file rg_deploy_bucket.yml --stack-name "$bucketstackname" \
+        --parameter-overrides S3NewBucketName="$bucketname"; then
         echo "Error: Failed to create S3 bucket via CloudFormation."
         exit 1
     fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@upload-assets.sh` around lines 50 - 56, The script currently calls "aws
cloudformation deploy" then redundantly runs "aws cloudformation wait
stack-create-complete" which can be incorrect for updates; remove the waiter
call and instead check the exit status of the deploy command (the "aws
cloudformation deploy --template-file rg_deploy_bucket.yml --stack-name
\"$bucketstackname\" --parameter-overrides S3NewBucketName=\"$bucketname\""
line). After the deploy command, test its exit code (e.g. if [ $? -ne 0 ] || use
"|| exit 1") and emit the existing error message/exit if it failed; delete the
"aws cloudformation wait stack-create-complete --stack-name
\"$bucketstackname\"" block and its associated if check.
cft-templates/ec2-nginxWebServer.yml-85-98 (1)

85-98: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing CreationPolicy renders cfn-signal ineffective.

The EC2 instance calls cfn-signal at line 98, but there is no CreationPolicy defined on the resource. Without it, CloudFormation will not wait for the signal and will mark the stack as complete before the bootstrap script finishes. Compare with ec2-rockyLinux.yml which correctly defines a CreationPolicy with ResourceSignal.

Proposed fix to add CreationPolicy
   EC2Instance:
     Type: AWS::EC2::Instance
+    CreationPolicy:
+      ResourceSignal:
+        Timeout: PT15M
     Properties:
       UserData:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/ec2-nginxWebServer.yml` around lines 85 - 98, The EC2Instance
resource is missing a CreationPolicy so the /opt/aws/bin/cfn-signal call in the
UserData will be ignored; add a CreationPolicy to the EC2Instance resource with
a ResourceSignal block (e.g., set Count to 1 and an appropriate Timeout like
PT15M) so CloudFormation waits for the cfn-signal from the instance; update the
EC2Instance resource's properties to include CreationPolicy and ResourceSignal
to match the pattern used in ec2-rockyLinux.yml.
cft-templates/ec2-rockyLinux.yml-62-76 (1)

62-76: ⚠️ Potential issue | 🟠 Major | ⚖️ Poor tradeoff

Overly permissive SSM and KMS policy should be scoped down.

The policy grants ssm:GetParameter, ssm:PutParameter, and KMS actions on Resource: "*". This allows the instance to read/write any SSM parameter and use any KMS key in the account. The instance only needs access to /RL/RG/nice-dcv/auth-token/* parameters.

Suggested scope for SSM resources
           - Sid: AllowSSMParamActions
             Effect: Allow
             Action:
               - ssm:GetParameter
               - ssm:PutParameter
               - ssm:DescribeParameters
-            Resource: "*"
+            Resource: !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/RL/RG/nice-dcv/auth-token/*"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/ec2-rockyLinux.yml` around lines 62 - 76, The policy statements
Sid: AllowSSMParamActions and Sid: AllowAccessToEncryptionKeys are overly
permissive (Resource: "*"); restrict Sid: AllowSSMParamActions to the exact SSM
parameter ARNs for the instance (e.g.
arn:aws:ssm:<region>:<account-id>:parameter/RL/RG/nice-dcv/auth-token/*) instead
of "*", and restrict Sid: AllowAccessToEncryptionKeys to only the specific KMS
key ARN(s) used to encrypt those parameters (replace "*" with
arn:aws:kms:<region>:<account-id>:key/<key-id>), keeping necessary actions
(kms:Decrypt, kms:GenerateDataKey, kms:DescribeKey) but scoped to that key;
ensure the parameter ARNs and key ARNs use the correct region and account values
for your deployment.
cft-templates/windows-remotedesktop.yml-71-91 (1)

71-91: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing CreationPolicy renders cfn-signal ineffective.

The EC2 instance calls cfn-signal.exe at line 87, but there is no CreationPolicy defined. CloudFormation will mark the stack as complete immediately after launching the instance, without waiting for the user data script to finish.

Proposed fix to add CreationPolicy
   EC2Instance:
     Type: AWS::EC2::Instance
+    CreationPolicy:
+      ResourceSignal:
+        Timeout: PT15M
     Properties:
       UserData:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/windows-remotedesktop.yml` around lines 71 - 91, The
EC2Instance resource uses cfn-signal but lacks a CreationPolicy so
CloudFormation won't wait for the signal; add a CreationPolicy block to the
EC2Instance resource with a ResourceSignal specifying Count: 1 and an
appropriate Timeout (e.g. PT15M) so the stack waits for the cfn-signal.exe call,
and ensure the signal's --resource value matches the EC2Instance logical name
used in the template.
cft-templates/windows-remotedesktop.yml-41-44 (1)

41-44: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Overly broad SSM permissions grant full Systems Manager access.

The policy grants ssm:* on all resources, which allows the instance to perform any SSM action across the entire account. This violates the principle of least privilege. Scope the permissions to only the required actions and resources.

Suggested fix to scope permissions
       PolicyDocument:
         Version: "2012-10-17"
         Statement:
           - Effect: Allow
             Action:
-              - ssm:*
-            Resource: "*"
+              - ssm:GetParameter
+              - ssm:PutParameter
+            Resource: !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/RL/RG/nice-dcv/auth-token/*"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/windows-remotedesktop.yml` around lines 41 - 44, The IAM policy
currently grants overly broad Systems Manager permissions via Action: "ssm:*"
and Resource: "*" — replace this with a scoped set of required SSM actions
(e.g., "ssm:SendCommand", "ssm:GetCommandInvocation",
"ssm:DescribeInstanceInformation", "ssm:StartSession" only if needed) and narrow
Resource ARNs instead of "*", or restrict to the specific instance(s) by using
the instance profile/instance ARN or parameter ARNs (e.g.,
"arn:aws:ec2:...:instance/INSTANCE_ID" or
"arn:aws:ssm:REGION:ACCOUNT_ID:parameter/...") so the Role only has
least-privilege access rather than "ssm:*" on all resources.
cft-templates/pcs_quick_lt.yaml-68-86 (1)

68-86: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Hardcoded VPC, subnet, and security group IDs will fail in other accounts.

Parameters VPC, DefaultPrivateSubnet, SecurityGroup, DefaultPublicSubnet, and VpcDefaultSecurityGroupId have account-specific default values that won't exist in other AWS accounts or regions.

Proposed fix - Remove defaults or make required
  VPC:
    Description: VPC to launch the Cluster nodes
    Type: String
-   Default: vpc-0e80ca11ecebc561d
+   AllowedPattern: "^vpc-[a-z0-9]+$"

  DefaultPrivateSubnet:
    Description: Private subnet for the PCluster
    Type: String
-   Default: subnet-005ed22f10a495974
+   AllowedPattern: "^subnet-[a-z0-9]+$"

Apply similar changes to SecurityGroup, DefaultPublicSubnet, and VpcDefaultSecurityGroupId.

Also applies to: 112-115

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/pcs_quick_lt.yaml` around lines 68 - 86, The CloudFormation
template currently hardcodes account-specific defaults for parameters VPC,
DefaultPrivateSubnet, SecurityGroup, DefaultPublicSubnet, and
VpcDefaultSecurityGroupId; remove these static Default values (or mark the
parameters as Required/no Default) so the template doesn't assume specific AWS
resource IDs and will fail in other accounts/regions—update the parameter blocks
for VPC, DefaultPrivateSubnet, SecurityGroup, DefaultPublicSubnet, and
VpcDefaultSecurityGroupId to omit the Default field (or set no Default and
ensure callers supply values) and adjust any code that expects defaults to
validate required inputs.
cft-templates/pcs_quick_lt.yaml-507-507 (1)

507-507: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Hardcoded AMI IDs are region-specific and not configurable.

Both node groups use ami-08608e2b2243c1f1b which is region-specific and will fail in other regions. Unlike pcs.yaml, this template lacks an AmiId parameter for override.

Proposed fix - Add AmiId parameter and use it

Add to Parameters section:

  AmiId:
    Type: String
    Description: AMI ID to use for PCS cluster nodes

Then update node groups:

-     AmiId: ami-08608e2b2243c1f1b
+     AmiId: !Ref AmiId

Also applies to: 528-528

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/pcs_quick_lt.yaml` at line 507, Add a new CloudFormation
parameter named AmiId (Type: String, Description: AMI ID to use for PCS cluster
nodes) and replace the hardcoded ami-08608e2b2243c1f1b occurrences in both node
group definitions with a reference to that parameter (use !Ref AmiId or
{"Ref":"AmiId"} in the AmiId property of the node group resources) so the AMI is
configurable per-deployment.
cft-templates/pcs.yaml-546-546 (1)

546-546: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Hardcoded AMI IDs ignore the AmiId parameter.

The AmiId parameter is defined (line 139) but both PCSNodeGroupLogin and PCSNodeGroupCompute use a hardcoded AMI ID (ami-0bf564070da947c48). This AMI is region-specific and will fail in other regions.

Proposed fix
  PCSNodeGroupLogin:
    Type: AWS::PCS::ComputeNodeGroup
    Properties:
      ...
-     AmiId: ami-0bf564070da947c48
+     AmiId: !Ref AmiId
      InstanceConfigs:

Apply the same change to PCSNodeGroupCompute at line 567.

Also applies to: 567-567

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/pcs.yaml` at line 546, Both node group resources are using a
hardcoded AMI ID instead of the template parameter; update the
EC2/AutoScaling/LaunchTemplate property for PCSNodeGroupLogin and
PCSNodeGroupCompute to reference the template parameter AmiId (not the literal
"ami-0bf564070da947c48") so the AMI is region-agnostic; replace the hardcoded
value with a CloudFormation parameter reference (using Ref or !Ref to AmiId) in
the LaunchTemplate/InstanceLaunchTemplate/LaunchTemplateData/ImageId (or
equivalent ImageId field) for both PCSNodeGroupLogin and PCSNodeGroupCompute.
cft-templates/pcs.yaml-194-197 (1)

194-197: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Hardcoded security group ID will fail in other AWS accounts.

The default value sg-00db64cee7ba85cb2 is account-specific and will cause stack creation failures when deployed to different accounts or environments.

Proposed fix
  VpcDefaultSecurityGroupId:
    Type: AWS::EC2::SecurityGroup::Id
    Description: Cluster VPC 'default' security group. Make sure you choose the one from your cluster VPC!
-   Default: sg-00db64cee7ba85cb2 # Default from main stack's SecurityGroup param

Remove the default or use a dynamic lookup via a Lambda-backed custom resource.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/pcs.yaml` around lines 194 - 197, The parameter
VpcDefaultSecurityGroupId is hardcoded to an account-specific value; remove the
Default: sg-00db64cee7ba85cb2 and either (A) make VpcDefaultSecurityGroupId a
required parameter with no Default so callers supply the correct SG for their
account, or (B) implement a Lambda-backed custom resource that runs
DescribeSecurityGroups filtered by VpcId and group-name "default" and returns
the SecurityGroupId, then have the template set VpcDefaultSecurityGroupId from
that custom resource output. Ensure the custom resource Lambda uses the cluster
VpcId input (or the existing Vpc parameter) and expose the resulting value for
any resources that reference VpcDefaultSecurityGroupId.
cft-templates/pcs_quick_lt.yaml-302-308 (1)

302-308: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Overly permissive KMS policy grants full access to all keys in the account.

kms:* on Resource: "*" allows instances to decrypt, delete, or modify any KMS key in the account. This violates least-privilege and poses a significant security risk if an instance is compromised.

Proposed fix - Scope to required actions
        - PolicyDocument:
            Version: "2012-10-17"
            Statement:
-             - Action: "kms:*"
+             - Action:
+                 - kms:Decrypt
+                 - kms:GenerateDataKey
+                 - kms:DescribeKey
                Effect: Allow
-               Resource: "*"
+               Resource: !Sub "arn:aws:kms:${AWS::Region}:${AWS::AccountId}:key/*"
          PolicyName: PcsKmsFullAccessPolicy

Scope to only the required KMS actions (e.g., Decrypt, GenerateDataKey) and restrict to specific key ARNs if possible.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/pcs_quick_lt.yaml` around lines 302 - 308, The
PcsKmsFullAccessPolicy currently grants overly broad permissions (PolicyDocument
with Action "kms:*" and Resource "*"); replace this with a least-privilege
policy that lists only required KMS actions (e.g., "kms:Decrypt",
"kms:GenerateDataKey", "kms:Encrypt" as needed) and narrow the Resource to the
specific KMS key ARNs used by this stack (or use a parameter like PcsKmsKeyArn)
instead of "*"; update the PolicyName block to reference the tightened
PolicyDocument and ensure any callers (roles or instance profiles) still
reference PcsKmsFullAccessPolicy after the change.
cft-templates/sagemakerUserProfile.yaml-323-338 (1)

323-338: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Delete flow will timeout if app/space doesn't exist or is already deleted.

The wait_for_status function catches ClientError (including ResourceNotFound) but continues polling indefinitely. When waiting for "Deleted" status during delete, if the resource is already gone, describe_app/describe_space will throw ResourceNotFound, causing a 600-second timeout instead of succeeding immediately.

🐛 Proposed fix to handle ResourceNotFound as deleted
          def wait_for_status(describe_fn, status_key, desired, max_wait=600, poll_interval=10, **kwargs):
              """Generic waiter for any SageMaker resource status"""
              start = time.time()
              while time.time() - start < max_wait:
                  try:
                      resp = describe_fn(**kwargs)
                      status = resp[status_key]
                      print(f"Current {status_key}: {status}")
                      if status == desired:
                          return True
                      if status == "Failed":
                          raise Exception(f"Resource failed with status 'Failed'")
                  except ClientError as e:
+                     if e.response["Error"]["Code"] == "ResourceNotFound":
+                         if desired == "Deleted":
+                             print("Resource not found - treating as deleted")
+                             return True
+                         raise
                      print(f"Error while waiting: {e}")
                  time.sleep(poll_interval)
              raise Exception(f"Timeout waiting for {status_key}={desired}")

Also applies to: 431-435

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/sagemakerUserProfile.yaml` around lines 323 - 338, The waiter
wait_for_status currently swallows ClientError and keeps polling, causing delete
flows to timeout when describe_* raises ResourceNotFound; update wait_for_status
(the function named wait_for_status used by describe_app/describe_space callers)
to detect ClientError with Error Code "ResourceNotFound" and, if the desired
status is "Deleted", return True immediately; for other ClientErrors preserve
existing behavior (log and continue or re-raise as appropriate) so only
ResourceNotFound during a delete returns success.
cft-templates/sagemakerUserProfile.yaml-387-402 (1)

387-402: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

App creation is not idempotent—Update will fail if app exists.

The Space creation checks for existence first (line 366), but app creation unconditionally calls create_app without checking if default app already exists. On CloudFormation Update, this will fail with ResourceInUseException.

🐛 Proposed fix to add idempotent app creation
                      # --- Create App ---
+                     try:
+                         sm.describe_app(
+                             DomainId=domain_id,
+                             SpaceName=space_name,
+                             AppType="JupyterLab",
+                             AppName="default"
+                         )
+                         print("App 'default' already exists")
+                     except ClientError as e:
+                         if e.response["Error"]["Code"] == "ResourceNotFound":
                              print(f"Creating JupyterLab app 'default' with tags")
                              sm.create_app(
                                  DomainId=domain_id,
                                  SpaceName=space_name,
                                  AppType="JupyterLab",
                                  AppName="default",
                                  ResourceSpec={
                                      "InstanceType": instance_type,
                                      "SageMakerImageArn": f"arn:aws:sagemaker:{region}:081325390199:image/sagemaker-distribution-cpu"
                                  },
                                  Tags=[
                                      *common_tags,
                                      {"Key": "InstanceType", "Value": instance_type}
                                  ]
                              )
+                         else:
+                             raise
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/sagemakerUserProfile.yaml` around lines 387 - 402, The app
creation is not idempotent: before calling sm.create_app(...) for
AppName="default" (using DomainId=domain_id, SpaceName=space_name,
AppType="JupyterLab"), first check whether that app already exists (e.g., call
sm.list_apps or sm.describe_app filtered by DomainId/SpaceName/AppName/AppType)
and only call create_app if it is not present; alternatively catch
ResourceInUseException around sm.create_app and treat it as success (log and
continue). Update the logic around the create_app call so it performs the
existence check or exception handling to make app creation idempotent.
SRE/lambdaresources/package.json-5-5 (1)

5-5: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix package.json entrypoint (main) to match existing file
SRE/lambdaresources/package.json sets "main": "index.mjs.mjs", but SRE/lambdaresources/ only contains index.mjs (and start runs node index.mjs), so the manifest entrypoint is inconsistent and likely breaks tooling that reads main.

Suggested fix
-  "main": "index.mjs.mjs",
+  "main": "index.mjs",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@SRE/lambdaresources/package.json` at line 5, The package.json "main" field is
incorrect ("index.mjs.mjs") and must point to the actual entry file; open
package.json and update the "main" property to "index.mjs" so it matches the
existing file and the start script that runs node index.mjs; verify no other
references expect the erroneous name.
SRE/lambdaresources/index.mjs-33-35 (1)

33-35: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t throw from the archiver 'error' listener; propagate via a Promise.

The throw inside archive.on('error', ...) won’t be reliably caught by the surrounding async handler’s try/catch, so failures may bypass your 500 response (or crash with an unhandled error). Reject a Promise from the listener and await it alongside the upload.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@SRE/lambdaresources/index.mjs` around lines 33 - 35, The archiver error
handler currently throws inside archive.on('error', ...) which can escape the
async try/catch; instead create a Promise (e.g., const archiveErrorPromise = new
Promise((_, reject) => archive.on('error', reject))) and reject from the
listener, then await that Promise together with the existing upload promise (or
use Promise.race/Promise.all) in the async flow so archiver errors are
propagated as rejections you can catch; update references to archive.on('error',
...) and any upload awaiting logic to await the new archiveErrorPromise.
SRE/lambdaresources/index.mjs-18-26 (1)

18-26: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Paginate listObjectsV2 results and guard for missing Contents before building the ZIP.

listObjectsV2 returns up to 1,000 keys per call; without pagination (IsTruncated/NextContinuationToken), objects beyond the first page won’t be added. Also, on empty prefixes Contents can be missing, so listedObjects.Contents.length and iterating listedObjects.Contents can fail.

Suggested fix
         const listParams = {
             Bucket: sourceBucket,
             Prefix: sourcePrefix
         };
-        const listedObjects = await s3.listObjectsV2(listParams).promise();
+        const contents = [];
+        let continuationToken;
+
+        do {
+            const page = await s3.listObjectsV2({
+                ...listParams,
+                ContinuationToken: continuationToken
+            }).promise();
+            contents.push(...(page.Contents ?? []));
+            continuationToken = page.IsTruncated ? page.NextContinuationToken : undefined;
+        } while (continuationToken);
 
-        if (listedObjects.Contents.length === 0) {
+        if (contents.length === 0) {
             console.log('No files found in the source folder');
             return { statusCode: 404, body: 'No files found in the source folder' };
         }
@@
-        for (const obj of listedObjects.Contents) {
+        for (const obj of contents) {
             const fileKey = obj.Key;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@SRE/lambdaresources/index.mjs` around lines 18 - 26, The code calling
s3.listObjectsV2 (using listParams/listedObjects) must be converted to a
paginated loop that follows IsTruncated/NextContinuationToken and accumulates
all returned objects into a single array before building the ZIP; also guard for
missing or undefined listedObjects.Contents by treating it as an empty array
(e.g., default to []) and only proceeding to ZIP if the accumulated array length
> 0. Update the block that checks listedObjects.Contents.length and the
iteration over listedObjects.Contents to use the accumulated results and handle
the case where Contents is absent.
SRE/lambdaresources/index.mjs-7-13 (1)

7-13: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard request parsing and fix pagination + archiver error propagation

  • Move JSON.parse(event.body) and required-field validation inside the try block so malformed/missing bodies return a controlled 400 instead of failing before error handling.
  • listObjectsV2 is only called once; if there are more than one page of keys, the code will zip an incomplete set—paginate using IsTruncated + NextContinuationToken.
  • Throwing inside archive.on('error', ...) won’t be reliably caught by the outer try/catch; route the error event into the awaited Promise path (e.g., reject a wrapper Promise).
Suggested fix
 export const handler = async (event) => {
-    const reqBody = JSON.parse(event.body);
-    const sourceBucket = reqBody.sourceBucket;
-    const sourcePrefix = reqBody.sourcePrefix;
-    const destinationBucket = reqBody.destinationBucket;
-    const destinationPrefix = reqBody.destinationPrefix;
-
     try {
+        const reqBody =
+            typeof event.body === 'string'
+                ? JSON.parse(event.body)
+                : (event.body ?? {});
+        const { sourceBucket, sourcePrefix, destinationBucket, destinationPrefix } = reqBody;
+
+        if (!sourceBucket || !sourcePrefix || !destinationBucket || !destinationPrefix) {
+            return {
+                statusCode: 400,
+                headers: { "Content-Type": "application/json" },
+                body: JSON.stringify({
+                    message: "sourceBucket, sourcePrefix, destinationBucket, and destinationPrefix are required"
+                }),
+                isBase64Encoded: false
+            };
+        }
+
         console.log(`Zipping and uploading files from ${sourcePrefix} to ${destinationPrefix}`);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@SRE/lambdaresources/index.mjs` around lines 7 - 13, Move
JSON.parse(event.body) and required-field checks into the try block inside
handler so malformed or missing bodies produce a controlled 400 response;
implement full pagination when gathering object keys by repeatedly calling
listObjectsV2 using IsTruncated and NextContinuationToken until all pages are
fetched (update the logic that currently calls listObjectsV2 once to accumulate
keys across pages); and convert the archiver error event handling so
archive.on('error', ...) rejects a Promise that the handler awaits (e.g., wrap
the archiver streaming/finish/error events in a Promise and reject on error)
instead of throwing inside the event handler, so errors propagate into the outer
try/catch.
🟡 Minor comments (6)
cft-templates/bedrock.yaml-66-72 (1)

66-72: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove duplicate iam:CreateRole action.

The action iam:CreateRole appears twice in the statement (lines 66 and 72). Remove the duplicate to clean up the policy.

🧹 Proposed fix
               - Effect: Allow
                 Action:
-                  - "iam:CreateRole"
                   - "iam:AttachRolePolicy"
                   - "iam:PutRolePolicy"
                   - "lambda:ListFunctions"
                   - "lambda:GetFunction"
                   - "lambda:InvokeFunction"
                   - "iam:CreateRole"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/bedrock.yaml` around lines 66 - 72, The policy statement
contains a duplicated action "iam:CreateRole"; remove the second occurrence of
"iam:CreateRole" from the actions array so each IAM action (e.g.,
"iam:CreateRole", "iam:AttachRolePolicy", "iam:PutRolePolicy",
"lambda:ListFunctions", "lambda:GetFunction", "lambda:InvokeFunction") appears
only once in the statement.
dump/standardcatalogitems.json-1531-1534 (1)

1531-1534: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Point the Nginx catalog item at its own docs.

The DetailsLink for Nginx Web Server still targets the RStudio page, so users will be sent to the wrong documentation from the catalog.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dump/standardcatalogitems.json` around lines 1531 - 1534, The DetailsLink for
the "Nginx Web Server" catalog item is incorrectly pointing to the RStudio docs;
update the JSON entry with Key "DetailsLink" for the Nginx item so its Value
points to the Nginx documentation URL (replace the current RStudio URL with the
Nginx docs page, e.g. the project's nginx docs path) to ensure the catalog links
to the correct documentation.
config/settings-config.json-1068-1073 (1)

1068-1073: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the IGV icon path.

Line 1072 uses ./../assets/images/Screen-icon.png, while the rest of the file uses ../../assets/.... This one resolves differently and is likely to break the running-state icon for IGV only.

Suggested fix
-            "imageUrl": "./../assets/images/Screen-icon.png"
+            "imageUrl": "../../assets/images/Screen-icon.png"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/settings-config.json` around lines 1068 - 1073, The imageUrl under the
"running" -> "connect" entry uses a wrong relative path
("./../assets/images/Screen-icon.png"); update the "imageUrl" value for that
object to use the consistent "../../assets/images/Screen-icon.png" path so it
matches the rest of the file and resolves correctly (look for the "running"
object and its "connect" array containing the "menu": "Remote Desktop" entry and
change its "imageUrl").
upload-assets.sh-7-12 (1)

7-12: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Missing exit code on error path.

When a required utility is not found, exit should include an explicit non-zero code to signal failure to calling scripts or CI pipelines.

Proposed fix
 	if ! command -v "$program" >/dev/null 2>&1; then
 		echo "$program not found. This Script needs jq and aws cli. Please install the application/s and restart deployment, Exiting."
-		exit
+		exit 1
 	else
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@upload-assets.sh` around lines 7 - 12, The failure path in upload-assets.sh
uses a bare exit which returns 0; update the error handling in the block that
checks the variable program (the `if ! command -v "$program" ...` branch) to
call exit with a non-zero code (e.g., `exit 1`) so calling scripts/CI receive a
failure status; keep the existing error message (optionally redirect it to
stderr) and leave the success branch (`echo "$program found"`) unchanged.
scripts/bootstrap-scripts/bootstrap.sh-89-90 (1)

89-90: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use POSIX-compliant || instead of -o in test expression.

The -o operator inside [ ] is not well defined by POSIX and can behave unexpectedly with certain string values.

Proposed fix
 # Exit if no S3 mounts were specified
-[ -z "$S3_MOUNTS" -o "$S3_MOUNTS" = "[]" ] && exit 0
+[ -z "$S3_MOUNTS" ] || [ "$S3_MOUNTS" = "[]" ] && exit 0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/bootstrap-scripts/bootstrap.sh` around lines 89 - 90, Replace the
non-POSIX test using "-o" with a POSIX-compliant compound test: evaluate each
predicate separately and combine them with the shell OR operator. For the
S3_MOUNTS check, change the line using [ -z "$S3_MOUNTS" -o "$S3_MOUNTS" = "[]"
] to a form like if [ -z "$S3_MOUNTS" ] || [ "$S3_MOUNTS" = "[]" ]; then exit 0;
fi (or use ([ -z "$S3_MOUNTS" ] || [ "$S3_MOUNTS" = "[]" ]) && exit 0) so the
S3_MOUNTS empty or equals "[]" condition is evaluated portably.
cft-templates/sagemakerUserProfile.yaml-311-311 (1)

311-311: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Upgrade Lambda runtime to a supported Python version (python3.9 → python3.11/3.12)

cft-templates/sagemakerUserProfile.yaml sets Runtime: python3.9 (line 311). AWS Lambda Python 3.9 reached end-of-support on Dec 15, 2025; AWS blocks new function creation starting Feb 1, 2027 and updates starting Mar 3, 2027. Upgrade to python3.11 or python3.12 for continued security/support.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/sagemakerUserProfile.yaml` at line 311, The template currently
sets the Lambda runtime to "python3.9" (Runtime: python3.9); update that Runtime
property in the AWS::Lambda::Function resource to a supported value such as
"python3.11" or "python3.12", then run a deployment/build to ensure any
dependencies, layers, or packaging (e.g., compiled wheels) are compatible and
adjust them if necessary.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8b61b836-60f5-4e51-af4f-fd4a292bdc62

📥 Commits

Reviewing files that changed from the base of the PR and between af56d84 and 5f71b66.

⛔ Files ignored due to path filters (1)
  • SRE/lambdaresources/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (21)
  • SRE/Egress/Setup-templates/egressresources.yml
  • SRE/lambdaresources/index.mjs
  • SRE/lambdaresources/package.json
  • cft-templates/awsSagemakerDomainAI.yaml
  • cft-templates/bedrock.yaml
  • cft-templates/ec2-nginxWebServer.yml
  • cft-templates/ec2-rockyLinux.yml
  • cft-templates/pcs.yaml
  • cft-templates/pcs_quick_lt.yaml
  • cft-templates/sagemakerUserProfile.yaml
  • cft-templates/windows-remotedesktop.yml
  • config/access-policy.json
  • config/access-secure-policy.json
  • config/config.json
  • config/settings-config.json
  • deploy.sh
  • dump/configs.json
  • dump/standardcatalogitems.json
  • rg_main_stack.yml
  • scripts/bootstrap-scripts/bootstrap.sh
  • upload-assets.sh

Comment thread cft-templates/ec2-rockyLinux.yml Outdated
Comment thread cft-templates/ec2-rockyLinux.yml Outdated
Comment on lines +209 to +217
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 0
ToPort: 65535
CidrIp: 0.0.0.0/0 # Allow all incoming traffic (adjust as needed for security)
SecurityGroupEgress:
# Allow all outbound traffic so instances can reach AWS services and the internet
- IpProtocol: '-1'
CidrIp: 0.0.0.0/0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Security group allows all TCP ports from anywhere – critical security risk.

The ClusterSecurityGroup allows inbound TCP on ports 0-65535 from 0.0.0.0/0. This exposes all services running on cluster nodes to the entire internet, creating a severe attack surface.

Proposed fix - Remove overly permissive ingress
      Tags:
        - Key: cost_resource
          Value: !Sub '${AWS::StackName}'
        - Key: Name
          Value: !Sub 'cluster-${AWS::StackName}'
-     SecurityGroupIngress:
-       - IpProtocol: tcp
-         FromPort: 0
-         ToPort: 65535
-         CidrIp: 0.0.0.0/0  # Allow all incoming traffic (adjust as needed for security)
      SecurityGroupEgress:

The self-referencing ingress rule (ClusterAllowAllInboundFromSelf) and the separate InboundSshSecurityGroup already provide the necessary intra-cluster communication and SSH access.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 0
ToPort: 65535
CidrIp: 0.0.0.0/0 # Allow all incoming traffic (adjust as needed for security)
SecurityGroupEgress:
# Allow all outbound traffic so instances can reach AWS services and the internet
- IpProtocol: '-1'
CidrIp: 0.0.0.0/0
SecurityGroupEgress:
# Allow all outbound traffic so instances can reach AWS services and the internet
- IpProtocol: '-1'
CidrIp: 0.0.0.0/0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/pcs_quick_lt.yaml` around lines 209 - 217, The
SecurityGroupIngress rule on ClusterSecurityGroup currently permits tcp FromPort
0 ToPort 65535 from 0.0.0.0/0; remove or tighten this overly-permissive ingress
in the SecurityGroupIngress block for ClusterSecurityGroup (delete the
0–65535/0.0.0.0/0 rule or replace it with specific allowed ports such as 22 and
any required application ports), relying on the existing
ClusterAllowAllInboundFromSelf for intra-cluster traffic and
InboundSshSecurityGroup for SSH access; validate the template after change to
ensure no other resources depend on the open-all rule.

Comment on lines +376 to +379
- |
cat > /etc/amazon/pcs/bootstrap_config.json <<'EOF'
{ "cluster": { "version": "Slurm_24.11", "disable_multithreading": true, "scheduler": "slurm", "base_os": "alinux2", "cluster_id": "${AWS::StackName}", "slurm": { "endpoint": "https://pcs.${AWS::Region}.api.aws" }}}
EOF
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Bootstrap config hardcodes Slurm version, ignoring SlurmVersion parameter.

Same issue as in pcs.yaml: the bootstrap config embeds "version": "Slurm_24.11" while the SlurmVersion parameter allows other versions. This will cause cluster misconfiguration.

Proposed fix
            - |
              cat > /etc/amazon/pcs/bootstrap_config.json <<'EOF'
-             { "cluster": { "version": "Slurm_24.11", "disable_multithreading": true, "scheduler": "slurm", "base_os": "alinux2", "cluster_id": "${AWS::StackName}", "slurm": { "endpoint": "https://pcs.${AWS::Region}.api.aws" }}}
+             { "cluster": { "version": "Slurm_${SlurmVersion}", "disable_multithreading": true, "scheduler": "slurm", "base_os": "alinux2", "cluster_id": "${AWS::StackName}", "slurm": { "endpoint": "https://pcs.${AWS::Region}.api.aws" }}}
              EOF

Note: Remove the <<'EOF' heredoc syntax when using CloudFormation !Sub interpolation, or use <<EOF (without quotes) to enable variable expansion.

Also applies to: 448-451

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/pcs_quick_lt.yaml` around lines 376 - 379, The bootstrap JSON
hardcodes "version": "Slurm_24.11" which ignores the SlurmVersion parameter;
update the heredoc to allow CloudFormation substitution and plug in the
SlurmVersion parameter (e.g. replace the literal "Slurm_24.11" with
${SlurmVersion} via !Sub), and remove the single-quoted heredoc (use <<EOF or
inline !Sub) so variables expand; apply the same change for the second
occurrence that sets the bootstrap_config.json.

Comment thread cft-templates/pcs.yaml
exec > >(tee -a /var/log/amazon/pcs/bootstrap.log | logger -t user-data -s 2>/dev/ttyS0) 2>&1

mkdir -p /etc/amazon/pcs
echo '{ "cluster": { "version": "Slurm_24.11", "disable_multithreading": true, "scheduler": "slurm", "base_os": "alinux2", "cluster_id": "${AWS::StackName}", "slurm": { "endpoint": "https://pcs.${AWS::Region}.api.aws" }}}' > /etc/amazon/pcs/bootstrap_config.json
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Bootstrap config hardcodes Slurm version, ignoring SlurmVersion parameter.

The bootstrap config JSON embeds "version": "Slurm_24.11" but users can select 23.11 or 24.05 via the SlurmVersion parameter. This mismatch will cause bootstrap failures when non-default versions are selected.

Proposed fix for LoginLaunchTemplate UserData (line 427)
-            echo '{ "cluster": { "version": "Slurm_24.11", "disable_multithreading": true, "scheduler": "slurm", "base_os": "alinux2", "cluster_id": "${AWS::StackName}", "slurm": { "endpoint": "https://pcs.${AWS::Region}.api.aws" }}}' > /etc/amazon/pcs/bootstrap_config.json
+            echo '{ "cluster": { "version": "Slurm_${SlurmVersion}", "disable_multithreading": true, "scheduler": "slurm", "base_os": "alinux2", "cluster_id": "${AWS::StackName}", "slurm": { "endpoint": "https://pcs.${AWS::Region}.api.aws" }}}' > /etc/amazon/pcs/bootstrap_config.json

Apply the same fix to both LoginLaunchTemplate (line 427) and ComputeLaunchTemplate (line 489).

Also applies to: 489-489

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/pcs.yaml` at line 427, The bootstrap JSON in the
LoginLaunchTemplate and ComputeLaunchTemplate UserData hardcodes "version":
"Slurm_24.11" causing mismatches with the SlurmVersion parameter; update both
UserData snippets (LoginLaunchTemplate and ComputeLaunchTemplate) to inject the
selected SlurmVersion (e.g., using the CloudFormation parameter reference for
SlurmVersion/AWS::NoValue pattern you used elsewhere) so the JSON writes
"version": "Slurm_<SlurmVersion>" (or equivalent) into
/etc/amazon/pcs/bootstrap_config.json, preserving the surrounding JSON structure
and escaping so the template emits the correct string for any SlurmVersion
value.

Comment thread cft-templates/sagemakerUserProfile.yaml
Comment thread rg_main_stack.yml
Comment on lines 130 to 133
- kms:Decrypt
- kms:GetKeyPolicy
- kms:PutKeyPolicy
Resource: "*"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Scope KMS policy mutation more narrowly.

kms:PutKeyPolicy on "*" lets the portal role rewrite any customer-managed key policy in the account. That is a privilege-escalation path if the instance is ever compromised and is much broader than the S3 bucket-policy change described in this PR. Limit this to the specific keys the portal is expected to manage, or gate it with conditions on the allowed key set.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rg_main_stack.yml` around lines 130 - 133, The IAM statement currently grants
kms:PutKeyPolicy (and other KMS actions) on Resource:"*", which is overly broad;
change the statement that mentions kms:PutKeyPolicy (and optionally kms:Decrypt,
kms:GetKeyPolicy) so it only targets the specific customer-managed KMS key ARNs
the portal must manage (replace Resource: "*" with an array of the exact
arn:aws:kms:...:key/<key-id> values used by the portal) or add an explicit
Condition that limits allowed keys (e.g., ArnLike/StringEquals on kms:KeyId or
kms:ResourceArn such that only the portal-managed key ARNs are permitted);
update the policy block that lists kms:PutKeyPolicy to use those key
ARNs/condition instead of the wildcard.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@config/config.json`:
- Around line 3-5: The Express app is bound to localhost which prevents other
containers from reaching advertised endpoints; update the express.host setting
in config.json (the "express" object, key "host") to bind to all interfaces
(e.g. "0.0.0.0") or an appropriate container-facing address so endpoints like
http://sp2_cc-3102:3002/... are reachable from sibling containers while keeping
express.port unchanged.
- Around line 184-188: sessionConfig currently contains a checked-in static
"secret" value; remove the hardcoded sessionConfig.secret and change runtime
config to read the secret from an external source (e.g., environment variable
SESSION_SECRET or a secrets manager) so deployments inject the real value;
update the config loader to fallback to process.env.SESSION_SECRET (or Secrets
Manager API) when sessionConfig.secret is absent, and adjust deployment tooling
that writes configs (scripts/fixconfigs.sh) and docker-compose runtime mounts to
pass the secret via environment/secret mechanism rather than embedding it in
config.json.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0d6da0f0-68cb-40bd-a7c6-c5a1cd2ecd8b

📥 Commits

Reviewing files that changed from the base of the PR and between 5f71b66 and 9230589.

📒 Files selected for processing (1)
  • config/config.json

Comment thread config/config.json
Comment thread config/config.json
RLIndia added 2 commits June 2, 2026 09:36
feat(ec2-rockyLinux): add password parameter and update user data script for ec2-user
sessionConfig secret should be replaced at deployment time.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
SRE/lambdaresources/index.mjs (2)

43-45: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Archive error handler throws inside event callback—error won't be caught.

Throwing from an 'error' event listener doesn't propagate to the surrounding try/catch. The error will surface as an unhandled rejection (or crash), bypassing the 500 response. Consider rejecting a tracked promise or setting an error flag.

🛡️ Proposed fix using a deferred rejection
+        let archiveError = null;
+
         archive.on('error', (error) => {
-            throw new Error(`Archiving error: ${error.message}`);
+            archiveError = new Error(`Archiving error: ${error.message}`);
+            archive.abort();
         });
 
         archive.pipe(passThroughStream);
 
         // Start S3 upload process asynchronously
         const uploadParams = {
             Bucket: destinationBucket,
             Key: destinationPrefix,
             Body: passThroughStream,
             ContentType: 'application/zip'
         };
         const uploadPromise = s3.upload(uploadParams).promise();
 
         // Add files to archive
         for (const obj of contents) {
+            if (archiveError) throw archiveError;
             const fileKey = obj.Key;

Alternatively, wrap the entire archive/upload section in a new Promise that rejects on the 'error' event.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@SRE/lambdaresources/index.mjs` around lines 43 - 45, The 'archive.on("error",
(error) => { throw new Error(...) })' handler throws inside an event callback so
it won't be caught by surrounding try/catch; change this to signal failure to
the outer flow by rejecting a Promise or setting an error flag instead.
Specifically, wrap the archive/upload flow in a new Promise and inside
archive.on('error', ...) call reject(error) (or set a local error variable and
ensure the outer Promise rejects after stream completion), and update any await
logic to await that Promise so the error is propagated and results in the 500
response; locate the handler on the archive instance and the surrounding
archive/upload logic to implement the Promise-based rejection.

34-37: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: listedObjects is undefined—function will crash at runtime.

The pagination logic populates contents, but line 34 references listedObjects.Contents.length which was never defined. This causes a ReferenceError on every invocation, preventing both the 404 early-return and the archiving path from ever executing.

🐛 Proposed fix
-        if (listedObjects.Contents.length === 0) {
+        if (contents.length === 0) {
             console.log('No files found in the source folder');
             return { statusCode: 404, body: 'No files found in the source folder' };
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@SRE/lambdaresources/index.mjs` around lines 34 - 37, The code checks
listedObjects.Contents but the pagination routine populates the variable named
contents, so replace the undefined reference by using the populated variable
(e.g., check contents.length) or wrap contents into an object (e.g.,
listedObjects = { Contents: contents }) before the existing if; update the
conditional that returns the 404 to reference the correct symbol (contents or
the newly-created listedObjects) so the function no longer throws a
ReferenceError and correctly handles the empty-case in the archive flow.
upload-assets.sh (1)

110-112: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing -r flag may exclude subdirectories from dump archive.

zip dump.zip dump/* without -r will not recursively include subdirectories. If dump/ contains nested directories, their contents will be silently omitted, potentially causing incomplete deployments.

🐛 Proposed fix
-zip dump.zip dump/*
+zip -r dump.zip dump/*
 unzip -l dump.zip
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@upload-assets.sh` around lines 110 - 112, The zip command currently uses "zip
dump.zip dump/*" which omits nested directories; change it to use the recursive
flag (e.g., "zip -r dump.zip dump") so the entire dump/ directory and its
subdirectories are included when creating dump.zip, then keep the subsequent
unzip -l and aws s3 cp steps unchanged; update the command that references
dump.zip and dump/* in the script accordingly.
cft-templates/windows-remotedesktop.yml (2)

24-31: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Default CIDR 0.0.0.0/0 exposes RDP and WinRM to the internet.

The default value opens ports 3389 (RDP), 5986 (WinRM over HTTPS), and 8443 (DCV) to the entire internet. This is a significant security risk, especially for Windows instances which are frequent targets for brute-force attacks.

Consider requiring users to explicitly specify their IP range or using a more restrictive default.

🔒 Suggested fix: Remove default or use placeholder
   AllowedSSHLocation:
     Description: The IP address range that can be used to SSH to the EC2 instances
     Type: String
     MinLength: "9"
     MaxLength: "18"
-    Default: 0.0.0.0/0
     AllowedPattern: (\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})/(\d{1,2})
     ConstraintDescription: must be a valid IP CIDR range of the form x.x.x.x/x.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/windows-remotedesktop.yml` around lines 24 - 31, The
AllowedSSHLocation parameter currently defaults to 0.0.0.0/0 which exposes
RDP/WinRM/DCV to the internet; remove the Default value (or replace it with a
non-routable placeholder like "REPLACE_ME" or an empty string) and update the
parameter validation/description so users must explicitly supply a CIDR (alter
AllowedSSHLocation's Default and Description and keep
AllowedPattern/ConstraintDescription as-is) to prevent an insecure open-default
configuration.

78-95: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

UserData lacks error handling; failures before set_user_token.bat won't be signaled.

PowerShell doesn't stop on errors by default. If net user fails (e.g., password doesn't meet complexity requirements), the script continues and $lastexitcode captures only the exit code of the most recent native command (set_user_token.bat), potentially signaling success despite the password failure.

🛠️ Suggested fix: Add error handling
       Fn::Base64: !Sub |
         <powershell>     
         # Redirect stdout and stderr to a log file and the console
         $scriptPath = "C:\Users\Administrator\log\user-data.log"
-        Start-Transcript -Path $scriptPath -Append  
+        Start-Transcript -Path $scriptPath -Append
+        $ErrorActionPreference = "Stop"
+        $exitCode = 0
         
-        net user Administrator ${Password}
+        try {
+          net user Administrator ${Password}
+          if ($LASTEXITCODE -ne 0) { throw "net user failed with exit code $LASTEXITCODE" }

-        echo "setting user token for nice-dcv"
-        C:\Users\Administrator\set_user_token.bat  
+          echo "setting user token for nice-dcv"
+          C:\Users\Administrator\set_user_token.bat
+          if ($LASTEXITCODE -ne 0) { throw "set_user_token.bat failed with exit code $LASTEXITCODE" }
+        } catch {
+          Write-Error $_
+          $exitCode = 1
+        }

         echo "signal cft to success/failure"
-        cfn-signal.exe -e $lastexitcode --stack ${AWS::StackId} --resource EC2Instance --region ${AWS::Region}
+        cfn-signal.exe -e $exitCode --stack ${AWS::StackId} --resource EC2Instance --region ${AWS::Region}

         # End the transcript (stop logging)
         Stop-Transcript
         </powershell>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/windows-remotedesktop.yml` around lines 78 - 95, Make the
UserData PowerShell script fail fast and always send the correct exit code to
cfn-signal: set PowerShell to stop on errors (e.g., set $ErrorActionPreference =
'Stop'), wrap critical native calls (net user Administrator and
C:\Users\Administrator\set_user_token.bat) in a try/catch, and in the catch call
cfn-signal.exe with a non‑zero error code and then exit; after running the
native set_user_token.bat, capture the native process result using $LASTEXITCODE
and pass that value to cfn-signal.exe (instead of relying on a last-invoked cmd
variable), then Stop-Transcript. Reference symbols: UserData block,
Start-Transcript/Stop-Transcript, net user Administrator,
C:\Users\Administrator\set_user_token.bat, cfn-signal.exe, and $LASTEXITCODE.
🧹 Nitpick comments (2)
upload-assets.sh (1)

51-57: ⚡ Quick win

Redundant wait and dead-code error check.

aws cloudformation deploy already waits for the stack operation to complete by default. The subsequent wait stack-create-complete is redundant and will fail if this is an update (not create). Additionally, with set -e enabled, the if [ $? -ne 0 ] check is dead code—if wait fails, the script exits before reaching the check.

♻️ Proposed fix
     aws cloudformation deploy --template-file rg_deploy_bucket.yml --stack-name "$bucketstackname" \
-        --parameter-overrides S3NewBucketName="$bucketname"
-    aws cloudformation wait stack-create-complete --stack-name "$bucketstackname"
-    if [ $? -ne 0 ]; then
-        echo "Error: Failed to create S3 bucket via CloudFormation."
-        exit 1
-    fi
+        --parameter-overrides S3NewBucketName="$bucketname" \
+    || { echo "Error: Failed to create S3 bucket via CloudFormation."; exit 1; }
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@upload-assets.sh` around lines 51 - 57, Remove the redundant explicit waiter
and dead-code error check: delete the aws cloudformation wait
stack-create-complete --stack-name "$bucketstackname" line and the following if
[ $? -ne 0 ]; then ... fi block; rely on aws cloudformation deploy (and the
script's set -e) to surface failures instead. This touches the aws
cloudformation deploy invocation and the wait/exit check surrounding
"$bucketstackname" and "$bucketname".
cft-templates/windows-remotedesktop.yml (1)

111-131: 💤 Low value

Security group description is misleading; port 80 purpose is unclear.

The GroupDescription says "Enable SSH access" but this is a Windows instance that uses RDP, not SSH. Additionally, port 80 (HTTP) is opened but DCV typically only needs port 8443 (HTTPS). If port 80 is unnecessary, consider removing it to reduce the attack surface.

📝 Suggested fix
   InstanceSecurityGroup:
     Type: AWS::EC2::SecurityGroup
     Properties:
-      GroupDescription: Enable SSH access
+      GroupDescription: Enable RDP and DCV access for Windows Remote Desktop
       SecurityGroupIngress:
-        - IpProtocol: tcp
-          FromPort: "80"
-          ToPort: "80"
-          CidrIp: !Ref "AllowedSSHLocation"
         - IpProtocol: tcp
           FromPort: "3389"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cft-templates/windows-remotedesktop.yml` around lines 111 - 131, The security
group InstanceSecurityGroup has a misleading GroupDescription ("Enable SSH
access") and an unnecessary ingress rule for port 80; update GroupDescription to
reflect RDP/management access (e.g., "Enable RDP and management ports") and
remove the SecurityGroupIngress entry that uses FromPort: "80" ToPort: "80";
keep the RDP (3389), WinRM/management (5986) and DCV (8443) rules as needed and
ensure the ingress CIDR reference (AllowedSSHLocation) is still appropriate or
renamed if necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cft-templates/windows-remotedesktop.yml`:
- Around line 24-31: The AllowedSSHLocation parameter currently defaults to
0.0.0.0/0 which exposes RDP/WinRM/DCV to the internet; remove the Default value
(or replace it with a non-routable placeholder like "REPLACE_ME" or an empty
string) and update the parameter validation/description so users must explicitly
supply a CIDR (alter AllowedSSHLocation's Default and Description and keep
AllowedPattern/ConstraintDescription as-is) to prevent an insecure open-default
configuration.
- Around line 78-95: Make the UserData PowerShell script fail fast and always
send the correct exit code to cfn-signal: set PowerShell to stop on errors
(e.g., set $ErrorActionPreference = 'Stop'), wrap critical native calls (net
user Administrator and C:\Users\Administrator\set_user_token.bat) in a
try/catch, and in the catch call cfn-signal.exe with a non‑zero error code and
then exit; after running the native set_user_token.bat, capture the native
process result using $LASTEXITCODE and pass that value to cfn-signal.exe
(instead of relying on a last-invoked cmd variable), then Stop-Transcript.
Reference symbols: UserData block, Start-Transcript/Stop-Transcript, net user
Administrator, C:\Users\Administrator\set_user_token.bat, cfn-signal.exe, and
$LASTEXITCODE.

In `@SRE/lambdaresources/index.mjs`:
- Around line 43-45: The 'archive.on("error", (error) => { throw new Error(...)
})' handler throws inside an event callback so it won't be caught by surrounding
try/catch; change this to signal failure to the outer flow by rejecting a
Promise or setting an error flag instead. Specifically, wrap the archive/upload
flow in a new Promise and inside archive.on('error', ...) call reject(error) (or
set a local error variable and ensure the outer Promise rejects after stream
completion), and update any await logic to await that Promise so the error is
propagated and results in the 500 response; locate the handler on the archive
instance and the surrounding archive/upload logic to implement the Promise-based
rejection.
- Around line 34-37: The code checks listedObjects.Contents but the pagination
routine populates the variable named contents, so replace the undefined
reference by using the populated variable (e.g., check contents.length) or wrap
contents into an object (e.g., listedObjects = { Contents: contents }) before
the existing if; update the conditional that returns the 404 to reference the
correct symbol (contents or the newly-created listedObjects) so the function no
longer throws a ReferenceError and correctly handles the empty-case in the
archive flow.

In `@upload-assets.sh`:
- Around line 110-112: The zip command currently uses "zip dump.zip dump/*"
which omits nested directories; change it to use the recursive flag (e.g., "zip
-r dump.zip dump") so the entire dump/ directory and its subdirectories are
included when creating dump.zip, then keep the subsequent unzip -l and aws s3 cp
steps unchanged; update the command that references dump.zip and dump/* in the
script accordingly.

---

Nitpick comments:
In `@cft-templates/windows-remotedesktop.yml`:
- Around line 111-131: The security group InstanceSecurityGroup has a misleading
GroupDescription ("Enable SSH access") and an unnecessary ingress rule for port
80; update GroupDescription to reflect RDP/management access (e.g., "Enable RDP
and management ports") and remove the SecurityGroupIngress entry that uses
FromPort: "80" ToPort: "80"; keep the RDP (3389), WinRM/management (5986) and
DCV (8443) rules as needed and ensure the ingress CIDR reference
(AllowedSSHLocation) is still appropriate or renamed if necessary.

In `@upload-assets.sh`:
- Around line 51-57: Remove the redundant explicit waiter and dead-code error
check: delete the aws cloudformation wait stack-create-complete --stack-name
"$bucketstackname" line and the following if [ $? -ne 0 ]; then ... fi block;
rely on aws cloudformation deploy (and the script's set -e) to surface failures
instead. This touches the aws cloudformation deploy invocation and the wait/exit
check surrounding "$bucketstackname" and "$bucketname".

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c3fd259a-21c3-430f-bc5a-ebb79171b23c

📥 Commits

Reviewing files that changed from the base of the PR and between 9230589 and 74724d0.

📒 Files selected for processing (11)
  • SRE/lambdaresources/index.mjs
  • SRE/lambdaresources/package.json
  • cft-templates/ec2-nginxWebServer.yml
  • cft-templates/ec2-rockyLinux.yml
  • cft-templates/windows-remotedesktop.yml
  • config/config.json
  • config/settings-config.json
  • deploy.sh
  • dump/configs.json
  • scripts/bootstrap-scripts/bootstrap.sh
  • upload-assets.sh
🚧 Files skipped from review as they are similar to previous changes (8)
  • deploy.sh
  • SRE/lambdaresources/package.json
  • scripts/bootstrap-scripts/bootstrap.sh
  • dump/configs.json
  • config/config.json
  • cft-templates/ec2-nginxWebServer.yml
  • config/settings-config.json
  • cft-templates/ec2-rockyLinux.yml

The image should be taken from the region in which the lambda is created.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants