fix(change-buffer): skip ops and chunk spans when the span is not live#2192
Draft
bengl wants to merge 1 commit into
Draft
fix(change-buffer): skip ops and chunk spans when the span is not live#2192bengl wants to merge 1 commit into
bengl wants to merge 1 commit into
Conversation
Previously, a single operation referencing a span that was no longer live -- already extracted (a late or duplicate op for an exported span), or whose Create was never applied -- made flush_change_buffer return SpanNotFound, aborting the entire batch. The caller then reset the change queue, discarding every still-pending operation including unrelated Creates; those spans were then orphaned and cascaded into further SpanNotFound errors at export time. flush_chunk shared the same fragility: one missing span id aborted extraction of the whole chunk. Now both paths skip the offending operation or span and continue. interpret_operation_cached still parses each op's payload so the read cursor stays aligned for subsequent ops, but applies mutations only when the target span is present. flush_chunk establishes its segment from the first live span and skips absent ids rather than returning an error. Skips are tallied in a new dropped_for_missing_span counter (exposed via a getter) and logged at debug level; non-zero values indicate benign late or duplicate operations rather than a fault.
Contributor
📚 Documentation Check Results📦
|
Contributor
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
Contributor
🔒 Cargo Deny Results📦
|
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: 2f34b53 | Docs | Datadog PR Page | Give us feedback! |
Contributor
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Makes the WASM change-buffer flush resilient to operations (and chunk spans) whose target span is no longer live in the span map. Previously a single such operation returned
SpanNotFound, which aborted the entire flush batch; now the offending op/span is skipped and the rest of the batch is processed.Motivation
The change buffer is consumed by dd-trace-js's native-spans pipeline. When an operation references a span that isn't live — because it was already extracted (a late or duplicate op for an exported span), or because its
Createwas never applied —flush_change_bufferreturnedErr(SpanNotFound), aborting the whole batch. The caller then reset the change queue, discarding every still-pending operation, including unrelatedCreates. Those spans were then orphaned and cascaded into furtherSpanNotFounderrors at export time. In a deep, high-churn workload (e.g. theplugin-graphql-longbenchmark under Node 20) this reliably crashed the host process.flush_chunkhad the same all-or-nothing fragility.Additional Notes
interpret_operation_cached: the span lookup is now anOption. Each op's payload is always parsed (keeping the read cursor aligned for subsequent ops), but mutations apply only when the span is present. Trace-level ops resolvesegment_idto a0sentinel when the span is absent — matching the existing non-cachedinterpret_operationpath — so they no-op rather than polluting a live segment.flush_chunk: establishes the segment from the first live span rather than the nominal first id, and skips absent ids instead of aborting. Best-effortfirst_is_local_root/chunk-root flags fall through to the first extracted span.dropped_for_missing_spancounter (with a getter) and logged atdebuglevel. Non-zero values indicate benign late/duplicate operations, not a fault.How to test the change?
cargo test -p libdd-trace-utils --features change-buffer change_buffer— 25 tests pass, including 7 new regression tests: a skipped op keeps the read cursor aligned; a missing-span op no longer aborts pendingCreates (the cascade regression);flush_chunkskips missing spans and returns empty when all are missing; andBatchSetMeta/BatchSetMetric/trace-level ops for a missing span consume their full payload.End-to-end: with a debug build of this change wired into dd-trace-js's native pipeline, the
plugin-graphql-longbenchmark (WITH_TRACER=1 WITH_DEPTH=6 OPERATIONS=800, Node 20) goes from crashing 8/8 runs to passing 8/8 with zerospan not foundreaching the host — even with the JS-side error swallow disabled.