Skip to content

OtlpProxy: remove in-app proxy; route OTLP at the ALB#3521

Merged
reakaleek merged 2 commits into
mainfrom
rhetorical-fine
Jun 17, 2026
Merged

OtlpProxy: remove in-app proxy; route OTLP at the ALB#3521
reakaleek merged 2 commits into
mainfrom
rhetorical-fine

Conversation

@reakaleek

@reakaleek reakaleek commented Jun 17, 2026

Copy link
Copy Markdown
Member

Why

The in-app `AdotOtlpService` proxy forwarded every browser OTLP batch over loopback to the
`otel-collector` sidecar. It was the source of `Connection reset by peer` errors (stale pooled
connections) and silent data loss. The root fix is to remove the app from the hot path entirely:
the collector sidecar already binds `0.0.0.0:4318` (OTLP HTTP) on the shared awsvpc task ENI
and is directly reachable from the ALB.

What

Deletes the proxy layer from docs-builder (`AdotOtlpService`, `IOtlpService`, `OtlpForwardResult`,
`OtlpProxyOptions`, `OtlpProxyRequest`, and `OtlpProxyTests`) and removes all call sites
(`MappingsExtensions` `/o/*` routes, `ServicesExtension.AddOtlpProxyService`,
`OtelRegistration.HttpClientTracingFilter`). Net: −769 lines.

The browser's existing paths (`/docs/_api/v1/o/{t,l,m}`) and the app's own server-side EDOT
export (`localhost:4317`) are unchanged.

How

Two companion changes complete the migration (separate PRs in their respective repos):

  • docs-infra: ALB listener rule (priority < 200) routing `/docs/_api/v1/o/*` → new target
    group on collector port 4318, health-checked on port 13133; SG ingress for 4318 + 13133.
  • docs-internal-workflows: otel-collector `COLLECTOR_CONFIG` sets `traces_url_path`,
    `logs_url_path`, `metrics_url_path` to the adblock-safe paths so the collector accepts them
    without any ALB rewrite.

Deploy sequence: collector config → infra (ALB rule) → this app change.

PooledConnectionIdleTimeout was unset (defaulted to 60s), so .NET
reused sockets closed server-side across Lambda freeze/thaw cycles,
causing batches to be silently dropped.

Set PooledConnectionIdleTimeout=2s so idle sockets are discarded
before the ADOT sidecar can reset them. Replace the one-dimensional
StaleConnectionDrops counter with a dimensioned otlp.proxy.forward
metric (outcome x signal_type) for forwarding success-rate visibility.
Set Activity status deliberately so stale drops don't surface as red
spans. Suppress the duplicate auto-instrumented HttpClient span for
the proxy hop (already covered by the app-level span) via a loopback
filter on AddHttpClientInstrumentation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@reakaleek reakaleek requested a review from a team as a code owner June 17, 2026 11:27
@reakaleek reakaleek added the bug label Jun 17, 2026
@reakaleek reakaleek requested a review from theletterf June 17, 2026 11:27
@reakaleek reakaleek temporarily deployed to integration-tests June 17, 2026 11:27 — with GitHub Actions Inactive
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1e865f9d-4958-40ca-9872-d948995f2c58

📥 Commits

Reviewing files that changed from the base of the PR and between ae597d1 and d978984.

📒 Files selected for processing (13)
  • src/api/Elastic.Documentation.Api/MappingsExtensions.cs
  • src/api/Elastic.Documentation.Api/OpenTelemetry/OpenTelemetryExtensions.cs
  • src/api/Elastic.Documentation.Api/Program.cs
  • src/api/Elastic.Documentation.Api/ServicesExtension.cs
  • src/api/Elastic.Documentation.Api/Telemetry/AdotOtlpService.cs
  • src/api/Elastic.Documentation.Api/Telemetry/IOtlpService.cs
  • src/api/Elastic.Documentation.Api/Telemetry/OtlpForwardResult.cs
  • src/api/Elastic.Documentation.Api/Telemetry/OtlpProxyOptions.cs
  • src/api/Elastic.Documentation.Api/Telemetry/OtlpProxyRequest.cs
  • src/api/Elastic.Documentation.Api/TelemetryConstants.cs
  • src/tooling/docs-builder/Http/DocumentationWebHost.cs
  • src/tooling/docs-builder/Http/StaticWebHost.cs
  • tests/Elastic.Documentation.Api.Tests/OtlpProxyTests.cs
💤 Files with no reviewable changes (9)
  • src/api/Elastic.Documentation.Api/Telemetry/OtlpForwardResult.cs
  • src/api/Elastic.Documentation.Api/Telemetry/OtlpProxyRequest.cs
  • src/api/Elastic.Documentation.Api/Telemetry/OtlpProxyOptions.cs
  • src/api/Elastic.Documentation.Api/Telemetry/IOtlpService.cs
  • tests/Elastic.Documentation.Api.Tests/OtlpProxyTests.cs
  • src/api/Elastic.Documentation.Api/TelemetryConstants.cs
  • src/api/Elastic.Documentation.Api/OpenTelemetry/OpenTelemetryExtensions.cs
  • src/api/Elastic.Documentation.Api/ServicesExtension.cs
  • src/api/Elastic.Documentation.Api/Telemetry/AdotOtlpService.cs

📝 Walkthrough

Walkthrough

The pull request removes the OTLP proxy subsystem from the docs API entirely. All proxy-related types are deleted: IOtlpService, OtlpForwardResult, OtlpProxyOptions, OtlpSignalType, OtlpProxyRequest, and AdotOtlpService. The AddOtlpProxyService DI registration is removed from ServicesExtension, the /o/* forwarding routes are removed from MappingsExtensions, and the mapOtlpEndpoints parameter is dropped from MapElasticDocsApiEndpoints. OtlpProxySourceName is removed from TelemetryConstants and from the AddDocsApiTracing source list. All call sites in Program.cs, DocumentationWebHost.cs, and StaticWebHost.cs are updated to the parameterless overload. OtlpProxyTests.cs is deleted.

Possibly Related PRs

  • elastic/docs-builder#3356: Injects per-instance OTEL_EXPORTER_OTLP_ENDPOINT into test fixtures for mocked OTLP services — directly coupled to the OTLP proxy infrastructure being removed here.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: removing the in-app OTLP proxy and routing it at the ALB instead.
Description check ✅ Passed The description clearly explains the rationale, components removed, and architectural changes without being off-topic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch rhetorical-fine

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/api/Elastic.Documentation.Api/Telemetry/AdotOtlpService.cs`:
- Around line 58-60: The ForwardCounter.Add method currently hardcodes the
outcome as "error" for all non-success responses, which prevents proper
differentiation between different failure modes. Replace the hardcoded "error"
string in the outcome KeyValuePair with a status-derived value that maps
specific HTTP status codes to appropriate outcome labels, such as
"collector_unavailable" for 503 responses and "timeout" for 504 responses, so
that the outcome dimension contract is properly honored and forwarding metrics
are not skewed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0fb7a0ab-c504-427b-be1c-468d8873b4c6

📥 Commits

Reviewing files that changed from the base of the PR and between a77e2be and ae597d1.

📒 Files selected for processing (5)
  • src/Elastic.Documentation.ServiceDefaults/Telemetry/OpenTelemetryRegistrationExtensions.cs
  • src/api/Elastic.Documentation.Api/Program.cs
  • src/api/Elastic.Documentation.Api/ServicesExtension.cs
  • src/api/Elastic.Documentation.Api/Telemetry/AdotOtlpService.cs
  • tests/Elastic.Documentation.Api.Tests/OtlpProxyTests.cs

Comment thread src/api/Elastic.Documentation.Api/Telemetry/AdotOtlpService.cs Outdated
@reakaleek reakaleek marked this pull request as draft June 17, 2026 11:41
The in-app proxy (AdotOtlpService) forwarded every browser OTLP batch
over loopback to the otel-collector sidecar. It was the source of
Connection-reset-by-peer errors (stale pooled connections) and silent
data loss that PR #3521 mitigated.

The otel-collector sidecar already binds 0.0.0.0:4318 (OTLP HTTP) and
is reachable from the ALB on the shared awsvpc network. Browser telemetry
will be routed at the ALB directly to port 4318 via a dedicated listener
rule and target group (infra change, separate PR). The collector's OTLP
HTTP receiver will be configured to accept the same adblock-safe paths
(/docs/_api/v1/o/{t,l,m}) directly (docs-internal-workflows change).

This commit removes the proxy from docs-builder:
- Delete AdotOtlpService, IOtlpService, OtlpForwardResult,
  OtlpProxyOptions, OtlpProxyRequest, and OtlpProxyTests
- Remove OTLP endpoint mappings (/o/t, /o/l, /o/m) from MappingsExtensions
- Remove AddOtlpProxyService from ServicesExtension
- Revert OtelRegistration.HttpClientTracingFilter (added solely to silence
  the proxy's own forwarding hops, no longer needed)
- Remove OtlpProxySourceName from TelemetryConstants and tracing setup

The app's own server-side EDOT export (localhost:4317) is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@reakaleek reakaleek temporarily deployed to integration-tests June 17, 2026 11:51 — with GitHub Actions Inactive
@reakaleek reakaleek changed the title OtlpProxy: fix stale-connection drops and error span noise OtlpProxy: remove in-app proxy; route OTLP at the ALB Jun 17, 2026

@Mpdreamz Mpdreamz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a better and improved solution ❤️

@Mpdreamz

Copy link
Copy Markdown
Member

The only thing is local development won't have the endpoints how are we handling that @reakaleek ?

I think we should leverage Aspire more to start services mirroring production, wdyt?

@reakaleek

Copy link
Copy Markdown
Member Author

The only thing is local development won't have the endpoints how are we handling that @reakaleek ?

I think we should leverage Aspire more to start services mirroring production, wdyt?

yup yup, we probably need to run a local otel collector to make this work locally (or what ever aspire can provide)

@reakaleek reakaleek marked this pull request as ready for review June 17, 2026 13:33
@reakaleek reakaleek merged commit aad7c23 into main Jun 17, 2026
25 checks passed
@reakaleek reakaleek deleted the rhetorical-fine branch June 17, 2026 18:42
@reakaleek reakaleek added chore and removed bug labels Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants