Skip to content

feat(security): per-session IAM scoping — session-tagged AssumeRole, DynamoDB leading-key conditions, Bedrock ARN allowlist #209

@krokoko

Description

@krokoko

Summary

⚠️ SCOPE REVISED 2026-05-28 (v4) — pending RE-APPROVAL. Evolution captured in this issue's comments: v2 reframed to the achievable design; v3 added design-review corrections (role-chaining 1h cap, Bedrock credential chain, full client surface); v4 makes the design backend-agnostic across both compute backends (AgentCore Runtime + ECS Fargate).

Reduce cross-task blast radius from a compromised agent session by scoping each task's agent tenant-data access (DynamoDB, trace S3) to short-lived, task_id/user_id-tagged IAM credentials — on both compute backends — instead of the long-lived shared compute role.

Roadmap: "Per-session IAM scoping" (Credentials and authorization). Design: docs/design/SECURITY.md. Resolves the TODO at cdk/src/stacks/agent.ts:376-389.

Backend context

The platform runs the agent under a ComputeStrategy:

  • AgentCore Runtime (live): agent boots under the per-runtime ExecutionRole. Bedrock already ARN-scoped via grantInvoke.
  • ECS Fargate (ecs-agent-cluster.ts, currently commented out in agent.ts:540-583, gating tracked in refactor(compute): gate ECS construct on compute_type context instead of comment toggle #164): agent boots under the Fargate task role (ECS credential endpoint). Bedrock currently Resource:'*'; task role is also missing Approvals/Nudges/trace/attachments grants.

Both base credentials are themselves assumed-role creds → an agent-side sts:AssumeRole is role chaining on both → 1h cap applies on both.

Design (backend-agnostic)

  1. Per-task SessionRole + refreshable agent-side AssumeRole (portable). Agent (agent/src/) assumes a per-task SessionRole with session tags {user_id, repo, task_id} at startup and uses the derived creds for tenant-data clients only. Identical Python on both backends. SessionRole self-constrains via aws:PrincipalTag/*.
    • Refreshable credential provider (botocore RefreshableCredentials) re-assumes before the 1h role-chaining cap; tasks run to maxLifetime 8h. A one-shot assume_role() would ExpiredToken mid-task — forbidden.
    • SessionRole trust policy permits both compute roles (AgentCore exec role + ECS task role) as principals allowed to sts:AssumeRole/sts:TagSession, constrained so only they may pass tag values.
  2. task_id leading-key conditions on TaskTable, TaskEventsTable, TaskApprovalsTable, TaskNudgesTable (dynamodb:LeadingKeys/FirstPartitionKeyValuesaws:PrincipalTag/task_id). Scan denied/omitted. Cross-table approval TransactWriteItems satisfied (shared task_id). On the SessionRole — backend-agnostic.
  3. S3 trace-prefix conditionaws:PrincipalTag/user_id (resolves agent.ts:376-389); attachments read scoped. On the SessionRole.
  4. Bedrock per backend:
    • AgentCore exec role: already ARN-scoped — leave as-is (optional action-glob tidy).
    • ECS task role: replace Resource:'*' on InvokeModel with the explicit model + inference-profile ARNs (parity with AgentCore). Net-new work.
  5. Compute-role slimming + parity: both compute roles keep baseline (Bedrock, logs, secrets, Memory) plus sts:AssumeRole/sts:TagSession on the SessionRole. Tenant-data grants move OFF the compute roles onto the SessionRole. ECS task role gains the currently-missing grants (Approvals, Nudges, trace, attachments) — now expressed only via the SessionRole, so parity is achieved by construction.

Tenant-data clients to switch to the session (6) vs. left on compute role (8)

  • Switch: DDB task_state.py:59/549, nudge_reader.py:82, progress_writer.py:380; S3 trace telemetry.py:456; S3 attachments read attachments.py:61.
  • Leave: secrets config.py:33/97 (PAT read once at startup, pre-assume), CloudWatch logs shell.py:67/server.py:153,180/telemetry.py:59,167, AgentCore Memory memory.py:43.
  • AgentCore Memory session-tag scoping: DEFERRED (not leading-key-able; namespace isolation actorId=repo/sessionId=task_id is the current boundary).

Acceptance criteria

  • Per-task SessionRole; trust policy admits both the AgentCore exec role and ECS task role as assuming principals (synth-verified).
  • Agent assumes SessionRole with tags {user_id, repo, task_id} via a refreshable provider; test simulates >1h run → auto re-assume (no ExpiredToken). Verifiable via CloudTrail / sts:GetCallerIdentity.
  • 6 tenant-data boto3 constructions use the session; 8 non-tenant remain on the compute role (agent tests assert).
  • task_id leading-key conditions on the 4 task tables (on SessionRole); Scan denied/omitted; synth tests assert; approval TransactWriteItems still succeeds.
  • user_id prefix condition on trace-bucket PutObject; agent.ts:376-389 TODO removed.
  • ECS task-role Bedrock Resource:'*' replaced with explicit model/inference-profile ARNs; ECS NagSuppression reason (ecs-agent-cluster.ts:141) updated accordingly.
  • Both compute roles slimmed to baseline + sts:AssumeRole/TagSession; tenant-data grants live only on the SessionRole; ECS parity (Approvals/Nudges/trace/attachments) achieved via SessionRole.
  • In-account validation that ${aws:PrincipalTag/...} drives dynamodb:LeadingKeys (policy simulator or live test).
  • mise //cdk:test, //cdk:synth, //agent:quality pass; no new unsuppressed cdk-nag findings (incl. the dormant ECS construct path).
  • docs/design/SECURITY.md + docs/guides/ROADMAP.md updated; Starlight mirrors regenerated (mise //docs:sync).
  • No task-lifecycle regression on a >1h task; verified on AgentCore (live). ECS construct synth-verified (backend dormant — no live deploy test possible until refactor(compute): gate ECS construct on compute_type context instead of comment toggle #164).

Out of scope

GitHub App/Token Vault PAT replacement; MicroVM attestation; layered per-tool derivation; principal-to-repo auth; table remodel to PK=user_id; AgentCore Memory session-tag scoping; enabling the ECS backend itself (#164).

Key references

Risks

  • Major change — CDK (SessionRole construct, AgentCore exec role, dormant ECS construct) + agent Python (refreshable STS provider, client wiring).
  • Touching the commented-out ECS block: must keep it synth-clean without enabling it (coordinate with refactor(compute): gate ECS construct on compute_type context instead of comment toggle #164).
  • SessionRole trust must prevent tag impersonation (only the two compute roles may pass tags).
  • Refreshable-creds bug class: clock skew, refresh-on-error, thread-safety across async hooks — explicit tests.

Metadata

Metadata

Assignees

Labels

approvedWhen an issue has been approved and readyenhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions