fix(lambda/grpc): code function-returned errors as FailedPrecondition#970
fix(lambda/grpc): code function-returned errors as FailedPrecondition#970arreyder wants to merge 1 commit into
Conversation
When a Lambda function is invoked successfully but its own code returns an error (invokeResp.FunctionError set), RoundTrip returned a bare error, which surfaces as gRPC codes.Unknown. otel counts Unknown as a span error / 5xx, so a handful of function-application faults at low call volume trips the 99.99% failure-rate monitor on be-temporal-lambda-manager. A function-returned error is a caller/function-state fault, not a server fault: recode both return paths inside the FunctionError block to FailedPrecondition. The retryable paths above (transient-network -> Unavailable, function timeout -> DeadlineExceeded) are unchanged, and the package retryer (pkg/retry) only retries Unavailable/DeadlineExceeded, so retry behavior is identical (Unknown was never retried either). Add a lambdaInvoker interface seam so RoundTrip is unit-testable and a regression test asserting FailedPrecondition on the function-error paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
General PR Review: fix(lambda/grpc): code function-returned errors as FailedPreconditionBlocking Issues: 0 | Suggestions: 1 | Threads Resolved: 0 Review Summary Security Issues Correctness Issues Suggestions
Prompt for AI agents |
|
Re: the
|
Problem
When the lambda gRPC transport invokes a function successfully but the function's own code returns an error (
invokeResp.FunctionErrorset),RoundTripreturned a barefmt.Errorf. That surfaces as gRPCcodes.Unknown, and our otel pipeline countsUnknownas a span error / HTTP 5xx.At low call volume this trips the 99.99% failure-rate monitor on be-temporal-lambda-manager — a handful of function-application faults (e.g. a connector that exits
Unhandledduring init, or a function whose startup config fetch is rejected) pages even though nothing on the server is broken.Fix
Recode both return paths inside the
if invokeResp.FunctionError != nilblock from bare error →status.Errorf(codes.FailedPrecondition, ...). A function-returned error is a caller/function-state fault, not a server fault, soFailedPrecondition(4xx, uncounted) is the correct code. The message text is preserved.Unchanged on purpose:
Unavailable, function timeout →DeadlineExceeded.failed to invoke lambda function) and marshal/unmarshal paths — genuine infra/our-bug faults that stay counted.No retry-behavior change:
pkg/retryonly retriesUnavailable/DeadlineExceeded.Unknownwas never retried, and neither isFailedPrecondition— so recoding is behavior-preserving for retries. (The connector-side Hello retry classifier,isRetryableHelloError, is the opposite direction — connector→c1api over a normal dial — and never sees these errors.)Test
Adds a
lambdaInvokerinterface seam (subset of*lambda.Client; the concrete client still satisfies it, no caller changes) soRoundTripis unit-testable, plus a regression test assertingFailedPreconditionon both function-error paths (meaningful-logs and no-meaningful-logs).go vetclean;go test ./pkg/lambda/grpc/passes.Downstream
c1 picks this up on the next baton-sdk release + re-vendor;
be-temporal-lambda-manager's existingclassifyLambdaErrorpass-through then carries theFailedPreconditioncode through unchanged.🤖 Generated with Claude Code