Skip to content

fix(lambda/grpc): code function-returned errors as FailedPrecondition#970

Open
arreyder wants to merge 1 commit into
mainfrom
arreyder/lambda-transport-classify-function-errors
Open

fix(lambda/grpc): code function-returned errors as FailedPrecondition#970
arreyder wants to merge 1 commit into
mainfrom
arreyder/lambda-transport-classify-function-errors

Conversation

@arreyder

Copy link
Copy Markdown
Contributor

Problem

When the lambda gRPC transport invokes a function successfully but the function's own code returns an error (invokeResp.FunctionError set), RoundTrip returned a bare fmt.Errorf. That surfaces as gRPC codes.Unknown, and our otel pipeline counts Unknown as a span error / HTTP 5xx.

At low call volume this trips the 99.99% failure-rate monitor on be-temporal-lambda-manager — a handful of function-application faults (e.g. a connector that exits Unhandled during init, or a function whose startup config fetch is rejected) pages even though nothing on the server is broken.

Fix

Recode both return paths inside the if invokeResp.FunctionError != nil block from bare error → status.Errorf(codes.FailedPrecondition, ...). A function-returned error is a caller/function-state fault, not a server fault, so FailedPrecondition (4xx, uncounted) is the correct code. The message text is preserved.

Unchanged on purpose:

  • The retryable paths above — transient-network → Unavailable, function timeout → DeadlineExceeded.
  • The invoke-failure (failed to invoke lambda function) and marshal/unmarshal paths — genuine infra/our-bug faults that stay counted.

No retry-behavior change: pkg/retry only retries Unavailable/DeadlineExceeded. Unknown was never retried, and neither is FailedPrecondition — so recoding is behavior-preserving for retries. (The connector-side Hello retry classifier, isRetryableHelloError, is the opposite direction — connector→c1api over a normal dial — and never sees these errors.)

Test

Adds a lambdaInvoker interface seam (subset of *lambda.Client; the concrete client still satisfies it, no caller changes) so RoundTrip is unit-testable, plus a regression test asserting FailedPrecondition on both function-error paths (meaningful-logs and no-meaningful-logs).

go vet clean; go test ./pkg/lambda/grpc/ passes.

Downstream

c1 picks this up on the next baton-sdk release + re-vendor; be-temporal-lambda-manager's existing classifyLambdaError pass-through then carries the FailedPrecondition code through unchanged.

🤖 Generated with Claude Code

When a Lambda function is invoked successfully but its own code returns an
error (invokeResp.FunctionError set), RoundTrip returned a bare error, which
surfaces as gRPC codes.Unknown. otel counts Unknown as a span error / 5xx, so
a handful of function-application faults at low call volume trips the 99.99%
failure-rate monitor on be-temporal-lambda-manager.

A function-returned error is a caller/function-state fault, not a server
fault: recode both return paths inside the FunctionError block to
FailedPrecondition. The retryable paths above (transient-network -> Unavailable,
function timeout -> DeadlineExceeded) are unchanged, and the package retryer
(pkg/retry) only retries Unavailable/DeadlineExceeded, so retry behavior is
identical (Unknown was never retried either).

Add a lambdaInvoker interface seam so RoundTrip is unit-testable and a
regression test asserting FailedPrecondition on the function-error paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@arreyder arreyder requested a review from a team June 17, 2026 14:26
Comment thread pkg/lambda/grpc/client.go
@github-actions

Copy link
Copy Markdown
Contributor

General PR Review: fix(lambda/grpc): code function-returned errors as FailedPrecondition

Blocking Issues: 0 | Suggestions: 1 | Threads Resolved: 0
Review mode: full
View review run

Review Summary
Reviewed the recode of the two FunctionError return paths in lambdaTransport.RoundTrip from bare fmt.Errorf (gRPC Unknown) to status.Errorf(codes.FailedPrecondition, ...), plus the new lambdaInvoker interface seam and its regression test. The interface seam is correct (the concrete *lambda.Client still satisfies it, no caller changes) and the retry claim holds — pkg/retry/retry.go:60 only retries Unavailable/DeadlineExceeded, so neither Unknown nor FailedPrecondition is retried. No security or confident correctness issues found; one suggestion about an undocumented secondary effect.

Security Issues
None found.

Correctness Issues
None found.

Suggestions

  • pkg/lambda/grpc/client.go:86 — The Unknown->FailedPrecondition recode also flips IsSyncPreservable (pkg/sync/syncer.go:71-83) from false to true for function-application faults; util.go:138 documents that function as a consumer of these codes. Confirm preserving the sync artifact on these faults is intended.
Prompt for AI agents
Verify each finding against the current code and only fix it if needed.

Suggestions:

In pkg/lambda/grpc/client.go:
- Around line 81-90: The two FunctionError return paths were recoded from bare error (gRPC codes.Unknown) to status.Errorf(codes.FailedPrecondition, ...). This has a secondary effect beyond the intended otel 5xx counting fix: pkg/sync/syncer.go IsSyncPreservable (lines 71-83) lists codes.FailedPrecondition in its preservable/recoverable allowlist (returns true) whereas codes.Unknown was not in it (returns false via default). The comment at pkg/lambda/grpc/util.go:138 explicitly names IsSyncPreservable as a downstream consumer of these codes. So a function that invokes successfully but returns an application error (e.g. connector exits Unhandled during init) will now have its sync artifact treated as recoverable/preserved, whereas before it was discarded. Confirm this artifact-preservation change is intended; if only the otel counting behavior was meant to change, consider a code that is uncounted by otel but still excluded from IsSyncPreservable, or update IsSyncPreservable and document the new behavior.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found.

@arreyder

Copy link
Copy Markdown
Contributor Author

Re: the IsSyncPreservable suggestion — confirmed intended.

pkg/sync/syncer.go already treats FailedPrecondition as preservable, alongside PermissionDenied, Unavailable, Unauthenticated, Aborted, and ResourceExhausted — i.e. external/transient faults where the partial sync artifact is still valid and worth keeping for resume. A function-application error is the same shape (the connector synced some resources, then its own code errored), so preserving the partial is consistent with how every other non-server fault is handled and strictly better than the previous Unknown → discard behavior. No change needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant