OtlpProxy: remove in-app proxy; route OTLP at the ALB#3521
Conversation
PooledConnectionIdleTimeout was unset (defaulted to 60s), so .NET reused sockets closed server-side across Lambda freeze/thaw cycles, causing batches to be silently dropped. Set PooledConnectionIdleTimeout=2s so idle sockets are discarded before the ADOT sidecar can reset them. Replace the one-dimensional StaleConnectionDrops counter with a dimensioned otlp.proxy.forward metric (outcome x signal_type) for forwarding success-rate visibility. Set Activity status deliberately so stale drops don't surface as red spans. Suppress the duplicate auto-instrumented HttpClient span for the proxy hop (already covered by the app-level span) via a loopback filter on AddHttpClientInstrumentation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (13)
💤 Files with no reviewable changes (9)
📝 WalkthroughWalkthroughThe pull request removes the OTLP proxy subsystem from the docs API entirely. All proxy-related types are deleted: Possibly Related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/api/Elastic.Documentation.Api/Telemetry/AdotOtlpService.cs`:
- Around line 58-60: The ForwardCounter.Add method currently hardcodes the
outcome as "error" for all non-success responses, which prevents proper
differentiation between different failure modes. Replace the hardcoded "error"
string in the outcome KeyValuePair with a status-derived value that maps
specific HTTP status codes to appropriate outcome labels, such as
"collector_unavailable" for 503 responses and "timeout" for 504 responses, so
that the outcome dimension contract is properly honored and forwarding metrics
are not skewed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0fb7a0ab-c504-427b-be1c-468d8873b4c6
📒 Files selected for processing (5)
src/Elastic.Documentation.ServiceDefaults/Telemetry/OpenTelemetryRegistrationExtensions.cssrc/api/Elastic.Documentation.Api/Program.cssrc/api/Elastic.Documentation.Api/ServicesExtension.cssrc/api/Elastic.Documentation.Api/Telemetry/AdotOtlpService.cstests/Elastic.Documentation.Api.Tests/OtlpProxyTests.cs
The in-app proxy (AdotOtlpService) forwarded every browser OTLP batch over loopback to the otel-collector sidecar. It was the source of Connection-reset-by-peer errors (stale pooled connections) and silent data loss that PR #3521 mitigated. The otel-collector sidecar already binds 0.0.0.0:4318 (OTLP HTTP) and is reachable from the ALB on the shared awsvpc network. Browser telemetry will be routed at the ALB directly to port 4318 via a dedicated listener rule and target group (infra change, separate PR). The collector's OTLP HTTP receiver will be configured to accept the same adblock-safe paths (/docs/_api/v1/o/{t,l,m}) directly (docs-internal-workflows change). This commit removes the proxy from docs-builder: - Delete AdotOtlpService, IOtlpService, OtlpForwardResult, OtlpProxyOptions, OtlpProxyRequest, and OtlpProxyTests - Remove OTLP endpoint mappings (/o/t, /o/l, /o/m) from MappingsExtensions - Remove AddOtlpProxyService from ServicesExtension - Revert OtelRegistration.HttpClientTracingFilter (added solely to silence the proxy's own forwarding hops, no longer needed) - Remove OtlpProxySourceName from TelemetryConstants and tracing setup The app's own server-side EDOT export (localhost:4317) is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mpdreamz
left a comment
There was a problem hiding this comment.
Such a better and improved solution ❤️
|
The only thing is local development won't have the endpoints how are we handling that @reakaleek ? I think we should leverage Aspire more to start services mirroring production, wdyt? |
yup yup, we probably need to run a local otel collector to make this work locally (or what ever aspire can provide) |
Why
The in-app `AdotOtlpService` proxy forwarded every browser OTLP batch over loopback to the
`otel-collector` sidecar. It was the source of `Connection reset by peer` errors (stale pooled
connections) and silent data loss. The root fix is to remove the app from the hot path entirely:
the collector sidecar already binds `0.0.0.0:4318` (OTLP HTTP) on the shared awsvpc task ENI
and is directly reachable from the ALB.
What
Deletes the proxy layer from docs-builder (`AdotOtlpService`, `IOtlpService`, `OtlpForwardResult`,
`OtlpProxyOptions`, `OtlpProxyRequest`, and `OtlpProxyTests`) and removes all call sites
(`MappingsExtensions` `/o/*` routes, `ServicesExtension.AddOtlpProxyService`,
`OtelRegistration.HttpClientTracingFilter`). Net: −769 lines.
The browser's existing paths (`/docs/_api/v1/o/{t,l,m}`) and the app's own server-side EDOT
export (`localhost:4317`) are unchanged.
How
Two companion changes complete the migration (separate PRs in their respective repos):
group on collector port 4318, health-checked on port 13133; SG ingress for 4318 + 13133.
`logs_url_path`, `metrics_url_path` to the adblock-safe paths so the collector accepts them
without any ALB rewrite.
Deploy sequence: collector config → infra (ALB rule) → this app change.