Skip to content

[plugin] Retry CreateFunction on IAM propagation instead of a fixed 10s wait#684

Merged
sebsto merged 3 commits into
mainfrom
fix/iam-role-propagation-wait
Jun 29, 2026
Merged

[plugin] Retry CreateFunction on IAM propagation instead of a fixed 10s wait#684
sebsto merged 3 commits into
mainfrom
fix/iam-role-propagation-wait

Conversation

@sebsto

@sebsto sebsto commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Motivation

On first-time lambda-deploy, the step printed as Resolving IAM role... always took ~10 seconds. After creating the execution role, createIAMRole ended with an unconditional Task.sleep(for: .seconds(10)) to wait out IAM's eventual consistency — so every new-role deploy paid a flat 10s regardless of how quickly the role actually propagated.

The wait exists for a real reason: a freshly created IAM role is not always assumable by Lambda immediately, so CreateFunction can fail with InvalidParameterValueException ("The role defined for the function cannot be assumed by Lambda") until the role propagates. But a fixed sleep is the worst of both worlds — slow for the common case, and still only a guess.

Change

Replace the fixed sleep with a bounded retry around CreateFunction, and introduce a reusable retry primitive.

  • withRetry(...) (Sources/AWSLambdaPluginHelper/Retry.swift) — a general retry utility:

    • Exponential backoff with equal jitter. The base delay doubles each attempt (initialDelay, 2×, 4×, …) capped at maxDelay; jitter then randomises the actual wait within [base/2, base]. Backoff bounds the load on a slow dependency; jitter prevents many concurrent deployments from retrying in lockstep (thundering herd).
    • Sensible defaults (maxAttempts: 8, initialDelay: 500ms, maxDelay: 20s) so the common call site stays clean: withRetry(isRetryable:) { ... }.
    • The delay maths is factored into backoffDelay(attempt:initialDelay:maxDelay:jitter:) so the curve and jitter are unit-tested deterministically (randomness is injected only inside withRetry).
  • Deployer.createFunction now wraps the API call in withRetry, retrying only the specific transient error (InvalidParameterValueException + "cannot be assumed by Lambda"). The flat 10s sleep in createIAMRole is removed. Most deploys now proceed as soon as the role propagates (often < 2s) instead of always waiting 10s.

Note on scope

I searched the deploy plugin for other hand-rolled retry/poll loops to migrate behind the utility — there were none (the only other loop is the stdout read loop in Process.swift, which is not a retry). withRetry is now the single retry primitive, ready for future call sites.

Testing

  • RetryTests — success-first-attempt, retry-then-succeed, exhaustion, non-retryable-rethrow, onRetry callback, plus deterministic backoffDelay coverage (exponential growth, maxDelay cap, jitter bounds, out-of-range jitter clamping).
  • DeployerIAMRoleTests — the transient-error detection predicate (matching error, unrelated InvalidParameterValueException, different error code, nil message).
  • Built and all 13 tests pass on macOS and in the swiftlang/swift:nightly-6.4.x-bookworm (Linux) container.

🤖 Generated with Claude Code

@sebsto sebsto added the 🔨 semver/patch No public API change. label Jun 29, 2026
@sebsto sebsto force-pushed the fix/iam-role-propagation-wait branch 2 times, most recently from d75f02b to 72a5851 Compare June 29, 2026 14:13
@sebsto sebsto force-pushed the fix/iam-role-propagation-wait branch from 72a5851 to c232970 Compare June 29, 2026 14:18
@sebsto sebsto merged commit 42bd3bb into main Jun 29, 2026
53 checks passed
@sebsto sebsto deleted the fix/iam-role-propagation-wait branch June 29, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🔨 semver/patch No public API change.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant