Skip to content

fix(change-buffer): skip ops and chunk spans when the span is not live#2192

Draft
bengl wants to merge 1 commit into
mainfrom
bengl/change-buffer-tolerate-missing-spans
Draft

fix(change-buffer): skip ops and chunk spans when the span is not live#2192
bengl wants to merge 1 commit into
mainfrom
bengl/change-buffer-tolerate-missing-spans

Conversation

@bengl

@bengl bengl commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Makes the WASM change-buffer flush resilient to operations (and chunk spans) whose target span is no longer live in the span map. Previously a single such operation returned SpanNotFound, which aborted the entire flush batch; now the offending op/span is skipped and the rest of the batch is processed.

Motivation

The change buffer is consumed by dd-trace-js's native-spans pipeline. When an operation references a span that isn't live — because it was already extracted (a late or duplicate op for an exported span), or because its Create was never applied — flush_change_buffer returned Err(SpanNotFound), aborting the whole batch. The caller then reset the change queue, discarding every still-pending operation, including unrelated Creates. Those spans were then orphaned and cascaded into further SpanNotFound errors at export time. In a deep, high-churn workload (e.g. the plugin-graphql-long benchmark under Node 20) this reliably crashed the host process. flush_chunk had the same all-or-nothing fragility.

Additional Notes

  • interpret_operation_cached: the span lookup is now an Option. Each op's payload is always parsed (keeping the read cursor aligned for subsequent ops), but mutations apply only when the span is present. Trace-level ops resolve segment_id to a 0 sentinel when the span is absent — matching the existing non-cached interpret_operation path — so they no-op rather than polluting a live segment.
  • flush_chunk: establishes the segment from the first live span rather than the nominal first id, and skips absent ids instead of aborting. Best-effort first_is_local_root/chunk-root flags fall through to the first extracted span.
  • Skips are counted in a new dropped_for_missing_span counter (with a getter) and logged at debug level. Non-zero values indicate benign late/duplicate operations, not a fault.
  • Behavior/robustness fix; no public signatures change (only an additive getter).

How to test the change?

cargo test -p libdd-trace-utils --features change-buffer change_buffer — 25 tests pass, including 7 new regression tests: a skipped op keeps the read cursor aligned; a missing-span op no longer aborts pending Creates (the cascade regression); flush_chunk skips missing spans and returns empty when all are missing; and BatchSetMeta/BatchSetMetric/trace-level ops for a missing span consume their full payload.

End-to-end: with a debug build of this change wired into dd-trace-js's native pipeline, the plugin-graphql-long benchmark (WITH_TRACER=1 WITH_DEPTH=6 OPERATIONS=800, Node 20) goes from crashing 8/8 runs to passing 8/8 with zero span not found reaching the host — even with the JS-side error swallow disabled.

Previously, a single operation referencing a span that was no longer
live -- already extracted (a late or duplicate op for an exported span),
or whose Create was never applied -- made flush_change_buffer return
SpanNotFound, aborting the entire batch. The caller then reset the
change queue, discarding every still-pending operation including
unrelated Creates; those spans were then orphaned and cascaded into
further SpanNotFound errors at export time. flush_chunk shared the same
fragility: one missing span id aborted extraction of the whole chunk.

Now both paths skip the offending operation or span and continue.
interpret_operation_cached still parses each op's payload so the read
cursor stays aligned for subsequent ops, but applies mutations only when
the target span is present. flush_chunk establishes its segment from the
first live span and skips absent ids rather than returning an error.

Skips are tallied in a new dropped_for_missing_span counter (exposed via
a getter) and logged at debug level; non-zero values indicate benign
late or duplicate operations rather than a fault.
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

📚 Documentation Check Results

⚠️ 731 documentation warning(s) found

📦 libdd-trace-utils - 731 warning(s)


Updated: 2026-07-02 21:05:34 UTC | Commit: 23e408b | missing-docs job results

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Clippy Allow Annotation Report

Comparing clippy allow annotations between branches:

  • Base Branch: origin/main
  • PR Branch: origin/bengl/change-buffer-tolerate-missing-spans

Summary by Rule

Rule Base Branch PR Branch Change

Annotation Counts by File

File Base Branch PR Branch Change

Annotation Stats by Crate

Crate Base Branch PR Branch Change
clippy-annotation-reporter 5 5 No change (0%)
datadog-ffe-ffi 1 1 No change (0%)
datadog-ipc 22 22 No change (0%)
datadog-live-debugger 4 4 No change (0%)
datadog-live-debugger-ffi 10 10 No change (0%)
datadog-profiling-replayer 4 4 No change (0%)
datadog-sidecar 45 45 No change (0%)
libdd-common 13 13 No change (0%)
libdd-common-ffi 12 12 No change (0%)
libdd-data-pipeline 6 6 No change (0%)
libdd-ddsketch 2 2 No change (0%)
libdd-dogstatsd-client 1 1 No change (0%)
libdd-profiling 13 13 No change (0%)
libdd-remote-config 3 3 No change (0%)
libdd-telemetry 20 20 No change (0%)
libdd-tinybytes 4 4 No change (0%)
libdd-trace-normalization 2 2 No change (0%)
libdd-trace-obfuscation 3 3 No change (0%)
libdd-trace-stats 1 1 No change (0%)
libdd-trace-utils 11 11 No change (0%)
Total 182 182 No change (0%)

About This Report

This report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

🔒 Cargo Deny Results

⚠️ 1 issue(s) found, showing only errors (advisories, bans, sources)

📦 libdd-trace-utils - 1 error(s)

Show output
error[unsound]: Rand is unsound with a custom logger using `rand::rng()`
    ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:181:1
    │
181 │ rand 0.8.5 registry+https://github.com/rust-lang/crates.io-index
    │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ unsound advisory detected
    │
    ├ ID: RUSTSEC-2026-0097
    ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0097
    ├ It has been reported (by @lopopolo) that the `rand` library is [unsound](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#soundness-of-code--of-a-library) (i.e. that safe code using the public API can cause Undefined Behaviour) when all the following conditions are met:
      
      - The `log` and `thread_rng` features are enabled
      - A [custom logger](https://docs.rs/log/latest/log/#implementing-a-logger) is defined
      - The custom logger accesses `rand::rng()` (previously `rand::thread_rng()`) and calls any `TryRng` (previously `RngCore`) methods on `ThreadRng`
      - The `ThreadRng` (attempts to) reseed while called from the custom logger (this happens every 64 kB of generated data)
      - Trace-level logging is enabled or warn-level logging is enabled and the random source (the `getrandom` crate) is unable to provide a new seed
      
      `TryRng` (previously `RngCore`) methods for `ThreadRng` use `unsafe` code to cast `*mut BlockRng<ReseedingCore>` to `&mut BlockRng<ReseedingCore>`. When all the above conditions are met this results in an aliased mutable reference, violating the Stacked Borrows rules. Miri is able to detect this violation in sample code. Since construction of [aliased mutable references is Undefined Behaviour](https://doc.rust-lang.org/stable/nomicon/references.html), the behaviour of optimized builds is hard to predict.
    ├ Announcement: https://github.com/rust-random/rand/pull/1763
    ├ Solution: Upgrade to >=0.10.1 OR <0.10.0, >=0.9.3 OR <0.9.0, >=0.8.6 (try `cargo update -p rand`)
    ├ rand v0.8.5
      ├── (dev) libdd-common v5.0.0
      │   ├── libdd-capabilities-impl v2.0.0
      │   │   └── libdd-trace-utils v8.0.0
      │   │       └── (dev) libdd-trace-utils v8.0.0 (*)
      │   └── libdd-trace-utils v8.0.0 (*)
      ├── (dev) libdd-trace-normalization v2.0.0
      │   └── libdd-trace-utils v8.0.0 (*)
      ├── libdd-trace-utils v8.0.0 (*)
      └── proptest v1.5.0
          └── (dev) libdd-tinybytes v1.1.1
              ├── (dev) libdd-tinybytes v1.1.1 (*)
              └── libdd-trace-utils v8.0.0 (*)

advisories FAILED, bans ok, sources ok

Updated: 2026-07-02 21:07:33 UTC | Commit: 23e408b | dependency-check job results

@datadog-datadog-prod-us1-2

Copy link
Copy Markdown

Tests

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 85.44%
Overall Coverage: 74.44% (-0.01%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 2f34b53 | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts

dd-octo-sts Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Artifact Size Benchmark Report

aarch64-alpine-linux-musl
Artifact Baseline Commit Change
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.a 85.88 MB 85.88 MB 0% (0 B) 👌
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.so 7.88 MB 7.88 MB 0% (0 B) 👌
aarch64-unknown-linux-gnu
Artifact Baseline Commit Change
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.61 MB 10.61 MB 0% (0 B) 👌
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.a 97.09 MB 97.09 MB 0% (0 B) 👌
libdatadog-x64-windows
Artifact Baseline Commit Change
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.dll 25.45 MB 25.45 MB 0% (0 B) 👌
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.lib 88.04 KB 88.04 KB 0% (0 B) 👌
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.pdb 184.55 MB 184.55 MB -0% (-8.00 KB) 👌
/libdatadog-x64-windows/debug/static/datadog_profiling_ffi.lib 945.16 MB 945.16 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.dll 8.32 MB 8.32 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.lib 88.04 KB 88.04 KB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.pdb 24.61 MB 24.61 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/static/datadog_profiling_ffi.lib 49.02 MB 49.02 MB 0% (0 B) 👌
libdatadog-x86-windows
Artifact Baseline Commit Change
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.dll 22.05 MB 22.05 MB 0% (0 B) 👌
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.lib 89.42 KB 89.42 KB 0% (0 B) 👌
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.pdb 188.58 MB 188.58 MB -0% (-8.00 KB) 👌
/libdatadog-x86-windows/debug/static/datadog_profiling_ffi.lib 934.16 MB 934.16 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.dll 6.43 MB 6.43 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.lib 89.42 KB 89.42 KB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.pdb 26.42 MB 26.42 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/static/datadog_profiling_ffi.lib 46.64 MB 46.64 MB 0% (0 B) 👌
x86_64-alpine-linux-musl
Artifact Baseline Commit Change
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.a 76.57 MB 76.57 MB 0% (0 B) 👌
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.so 8.78 MB 8.78 MB 0% (0 B) 👌
x86_64-unknown-linux-gnu
Artifact Baseline Commit Change
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.a 92.08 MB 92.08 MB 0% (0 B) 👌
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.69 MB 10.69 MB 0% (0 B) 👌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant