[plugin] Retry CreateFunction on IAM propagation instead of a fixed 10s wait#684
Merged
Conversation
d75f02b to
72a5851
Compare
72a5851 to
c232970
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
On first-time
lambda-deploy, the step printed asResolving IAM role...always took ~10 seconds. After creating the execution role,createIAMRoleended with an unconditionalTask.sleep(for: .seconds(10))to wait out IAM's eventual consistency — so every new-role deploy paid a flat 10s regardless of how quickly the role actually propagated.The wait exists for a real reason: a freshly created IAM role is not always assumable by Lambda immediately, so
CreateFunctioncan fail withInvalidParameterValueException("The role defined for the function cannot be assumed by Lambda") until the role propagates. But a fixed sleep is the worst of both worlds — slow for the common case, and still only a guess.Change
Replace the fixed sleep with a bounded retry around
CreateFunction, and introduce a reusable retry primitive.withRetry(...)(Sources/AWSLambdaPluginHelper/Retry.swift) — a general retry utility:initialDelay, 2×, 4×, …) capped atmaxDelay; jitter then randomises the actual wait within[base/2, base]. Backoff bounds the load on a slow dependency; jitter prevents many concurrent deployments from retrying in lockstep (thundering herd).maxAttempts: 8,initialDelay: 500ms,maxDelay: 20s) so the common call site stays clean:withRetry(isRetryable:) { ... }.backoffDelay(attempt:initialDelay:maxDelay:jitter:)so the curve and jitter are unit-tested deterministically (randomness is injected only insidewithRetry).Deployer.createFunctionnow wraps the API call inwithRetry, retrying only the specific transient error (InvalidParameterValueException+ "cannot be assumed by Lambda"). The flat 10s sleep increateIAMRoleis removed. Most deploys now proceed as soon as the role propagates (often < 2s) instead of always waiting 10s.Note on scope
I searched the deploy plugin for other hand-rolled retry/poll loops to migrate behind the utility — there were none (the only other loop is the stdout read loop in
Process.swift, which is not a retry).withRetryis now the single retry primitive, ready for future call sites.Testing
RetryTests— success-first-attempt, retry-then-succeed, exhaustion, non-retryable-rethrow,onRetrycallback, plus deterministicbackoffDelaycoverage (exponential growth,maxDelaycap, jitter bounds, out-of-range jitter clamping).DeployerIAMRoleTests— the transient-error detection predicate (matching error, unrelatedInvalidParameterValueException, different error code, nil message).swiftlang/swift:nightly-6.4.x-bookworm(Linux) container.🤖 Generated with Claude Code