Skip to content

feat(library-config): otel process context reader#2176

Open
cataphract wants to merge 14 commits into
mainfrom
levi/otel-thread-context
Open

feat(library-config): otel process context reader#2176
cataphract wants to merge 14 commits into
mainfrom
levi/otel-thread-context

Conversation

@cataphract

@cataphract cataphract commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Add read(), threadlocal_attribute_key_map(), and read_threadlocal_attribute_key_map() to otel_process_ctx::linux, along with the find_otel_mapping() / is_named_otel_mapping() / parse_mapping_start().

What does this PR do?

A brief description of the change being made with this pull request.

Motivation

What inspired you to submit this pull request?

Additional Notes

Anything else we should know when reviewing?

How to test the change?

Describe here in detail how the change can be validated.

BREAKING CHANGE: unpublish becomes unsafe.

@cataphract cataphract requested a review from a team as a code owner June 29, 2026 15:41
@cataphract cataphract requested review from mabdinur and removed request for a team June 29, 2026 15:41
Add `read()`, `threadlocal_attribute_key_map()`, and
`read_threadlocal_attribute_key_map()` to `otel_process_ctx::linux`,
along with the `find_otel_mapping()` / `is_named_otel_mapping()` /
`parse_mapping_start()`.
@cataphract cataphract force-pushed the levi/otel-thread-context branch from 2887d0b to cfe279d Compare June 29, 2026 15:42
@github-actions

Copy link
Copy Markdown
Contributor

Clippy Allow Annotation Report

Comparing clippy allow annotations between branches:

  • Base Branch: origin/main
  • PR Branch: origin/levi/otel-thread-context

Summary by Rule

Rule Base Branch PR Branch Change

Annotation Counts by File

File Base Branch PR Branch Change

Annotation Stats by Crate

Crate Base Branch PR Branch Change
clippy-annotation-reporter 5 5 No change (0%)
datadog-ffe-ffi 1 1 No change (0%)
datadog-ipc 22 22 No change (0%)
datadog-live-debugger 4 4 No change (0%)
datadog-live-debugger-ffi 10 10 No change (0%)
datadog-profiling-replayer 4 4 No change (0%)
datadog-sidecar 45 45 No change (0%)
libdd-common 13 13 No change (0%)
libdd-common-ffi 12 12 No change (0%)
libdd-data-pipeline 6 6 No change (0%)
libdd-ddsketch 2 2 No change (0%)
libdd-dogstatsd-client 1 1 No change (0%)
libdd-profiling 13 13 No change (0%)
libdd-remote-config 3 3 No change (0%)
libdd-telemetry 20 20 No change (0%)
libdd-tinybytes 4 4 No change (0%)
libdd-trace-normalization 2 2 No change (0%)
libdd-trace-obfuscation 3 3 No change (0%)
libdd-trace-stats 1 1 No change (0%)
libdd-trace-utils 11 11 No change (0%)
Total 182 182 No change (0%)

About This Report

This report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality.

@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

📚 Documentation Check Results

⚠️ 267 documentation warning(s) found

📦 libdd-library-config - 267 warning(s)


Updated: 2026-07-03 00:57:05 UTC | Commit: d3773a9 | missing-docs job results

@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

🔒 Cargo Deny Results

⚠️ 1 issue(s) found, showing only errors (advisories, bans, sources)

📦 libdd-library-config - 1 error(s)

Show output
error[unsound]: Rand is unsound with a custom logger using `rand::rng()`
   ┌─ /home/runner/work/libdatadog/libdatadog/Cargo.lock:62:1
   │
62 │ rand 0.8.5 registry+https://github.com/rust-lang/crates.io-index
   │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ unsound advisory detected
   │
   ├ ID: RUSTSEC-2026-0097
   ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0097
   ├ It has been reported (by @lopopolo) that the `rand` library is [unsound](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#soundness-of-code--of-a-library) (i.e. that safe code using the public API can cause Undefined Behaviour) when all the following conditions are met:
     
     - The `log` and `thread_rng` features are enabled
     - A [custom logger](https://docs.rs/log/latest/log/#implementing-a-logger) is defined
     - The custom logger accesses `rand::rng()` (previously `rand::thread_rng()`) and calls any `TryRng` (previously `RngCore`) methods on `ThreadRng`
     - The `ThreadRng` (attempts to) reseed while called from the custom logger (this happens every 64 kB of generated data)
     - Trace-level logging is enabled or warn-level logging is enabled and the random source (the `getrandom` crate) is unable to provide a new seed
     
     `TryRng` (previously `RngCore`) methods for `ThreadRng` use `unsafe` code to cast `*mut BlockRng<ReseedingCore>` to `&mut BlockRng<ReseedingCore>`. When all the above conditions are met this results in an aliased mutable reference, violating the Stacked Borrows rules. Miri is able to detect this violation in sample code. Since construction of [aliased mutable references is Undefined Behaviour](https://doc.rust-lang.org/stable/nomicon/references.html), the behaviour of optimized builds is hard to predict.
   ├ Announcement: https://github.com/rust-random/rand/pull/1763
   ├ Solution: Upgrade to >=0.10.1 OR <0.10.0, >=0.9.3 OR <0.9.0, >=0.8.6 (try `cargo update -p rand`)
   ├ rand v0.8.5
     └── libdd-library-config v2.0.0

advisories FAILED, bans ok, sources ok

Updated: 2026-07-03 00:59:10 UTC | Commit: d3773a9 | dependency-check job results

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2887d0b569

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +484 to +485
let payload = unsafe { std::slice::from_raw_parts(payload_ptr, payload_size as usize) };
let context = ProcessContext::decode(payload)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Synchronize before decoding the publisher's payload

When read() runs concurrently with another thread calling publish()/update(), the timestamp check does not keep the old payload allocation alive: after the first published_at load, ProcessContextHandle::update can assign self.payload = payload, dropping the Vec that payload_ptr still references, and this code then builds a slice and lets prost read from a dangling pointer. The final timestamp comparison happens only after the unsafe read, so it cannot prevent a crash/UB; the same-process reader needs to coordinate with the publisher or otherwise copy from memory that cannot be freed during decode.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... This file seems to want to update some sort of seqlock, but there are several problems with it, one of these is this reclamation problem; a seqlock can protects only static region. Also a seqlock depends on a counter being odd to detect in-progress changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely the solution involves in-process readers using the write lock.

@yannham yannham Jul 1, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lso a seqlock depends on a counter being odd to detect in-progress changes.

I think the odd thing is because of multiple writers. If you have one writer only, and multiple reader, a monotonic value should be enough to detect torn reads (which I believe is the model for the process context).

one of these is this reclamation problem; a seqlock can protects only static region

I think I don't understand what you mean by that?

@cataphract cataphract Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yannham

I think the odd thing is because of multiple writers.

It's not, it's to detect that a modification is progress, so the read may be give torn/inconsistent data. Assume there is no reordering:

  1. update data
  2. ++seq

and the reader does

  1. read seq
  2. read data
  3. re-read seq to confirm it's the same

then the reader may read partially updated data and see the same seq.

But anyway, I was wrong about this, because the writer uses 0 to signal the update in progress.

one of these is this reclamation problem; a seqlock can protects only static region

What I mean is that the reader may fetch a pointer to the payload, and by the time it reads it, it's been freed already. This is a problem if the memory region has been unmapped (segfault).

@cataphract cataphract Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the wrong publishing protocol was chosen for this. What the seqlock is protecting is a single pointer, so you could just update the pointer atomically with release-acquire (more or less, it also has a size, but that's not an obstacle because amd64/arm64 support 16-byte atomics, or the size could be moved to the beginning of the pointed to region). This doesn't eliminate the reclamation problem -- you'd still need something else to prevent in progress readers from segfaulting, but it does show that the seqlock buys nothing here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, thanks for the clarification for the odd counter part. I didn't remember that correctly.

I guess in the process context case the information is just split across two different fields.

What I mean is that the reader may fetch a pointer to the payload, and by the time it reads it, it's been freed already. This is a problem if the memory region has been unmapped (segfault).

@ivoanjo I don't remember all the details from the spec, but I think the model here is to map a page at init time and then never unmap it (beyond in-place update)? Of course you can't enforce that from the reader side, you have to rely on the writer to be compliant.

But it seems that process_vm_readv() returns an error, and doesn't cause SEGFAULT, for invalid/unmapped addresses (not clear from the doc, but so says my clanker). So I won't be worried about unmapping the context page.

However, if the payload is being modified, I think you're right. I don't see how you can avoid:

  • reader see a consistent context and read it locally
  • writer starts and completes an update by re-allocating the payload, causing the previous one to be dropped if heap-allocated, or overwritten if stored in the mapping.
  • reader starts to dereference the payload ptr

This case might be even worse because you won't get EINVAL, since the memory will most likely still be mapped (even in the heap), just potential gibberish 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one fix would be to restrict the reader to always use the same mapping for the payload, and include in the bracket as part of the "writing". The pointer would effectively be mostly useless, and the payload would be a static part (beside the length) of the whole context.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yannham

I think the model here is to map a page at init time and then never unmap it (beyond in-place update)?

This is the case for the header. The problem is the payload.

But it seems that process_vm_readv() returns an error, and doesn't cause SEGFAULT, for invalid/unmapped addresses

Yes, out-of-process reads don't have this problem. A more difficult question, with process_vm_readv doing plain memcpy page-by-page and having the payload and header in different pages is whether they participate correctly in the protocol, which something that needs to be seen architecture-by-architecture, but probably it's not a big obstacle.

However, if the payload is being modified, I think you're right. I don't see how you can avoid:

The way this is usually avoided for in-process readers is to use specific reclamation strategies like RCU, hazard pointers, refcounting, etc... or use read-write locks.

I think that reading gibberish, even in-process, is not that bad, because it can be detected by the seqlock protocol (assuming you follow the pointer and read all your data before re-checking the sequence number). It's just the if the memory has been unmapped that it becomes a problem.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've already discussed in the channel, but looping back on this -- my suggestion is to either have an out-of-band mechanism for in-process reads synchronization OR use a process_vm_readv or equivalent reader.

Comment on lines +415 to +417
name.starts_with("/memfd:OTEL_CTX")
|| name.starts_with("[anon_shmem:OTEL_CTX]")
|| name.starts_with("[anon:OTEL_CTX]")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match only the exact OTEL_CTX mapping name

Because these checks use starts_with, a process that also has a mapping such as /memfd:OTEL_CTX_BACKUP or [anon:OTEL_CTX_old] can be selected before the real /memfd:OTEL_CTX entry, since /proc/self/maps is ordered by address rather than by mapping name. In that case read() will return an invalid-signature error (or read the wrong context) even though a valid OTEL context mapping exists later in the maps file; the name match should be exact, with only the kernel's separate (deleted) token handled as metadata.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would there be mappings with names like "/memfd:OTEL_CTX_BACKUP" or "[anon:OTEL_CTX_old]"? Yes, sure, it's technically true but... what are we supposed to do about this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't there a suggestion on adding several of these mappings for PHP to account for it processing several service/env? Ultimately we didn't do it, but I think that's indeed the idea of the spec, that you could have several.

This implements literally what the spec says: https://github.com/open-telemetry/opentelemetry-specification/blob/14137d0022f98bb59392f82255a70f3d9c2a7d4c/oteps/profiles/4719-process-ctx.md?plain=1#L164

Locate mapping: Parse /proc/<pid>/maps and search for entries with name starting with [anon_shmem:OTEL_CTX], [anon:OTEL_CTX] or /memfd:OTEL_CTX

though I think it should say [anon_shmem:OTEL_CTX and [anon:OTEL_CTX. Or maybe it just wanted to allow the (deleted) afterwards

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's for the (deleted) stuff. It's not for multiple mappings to find.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @bwoebi said -- the process context is expected to be a singleton for a process, and anything that's not a singleton for a process should not go in there.

E.g. if a process is 1:1 to a service, then service name goes in there (for instance, this is the plan for dd-trace-rb); if a process is somehow multi-tenant and supports multiple services (e.g. dd-trace-php), then it shouldn't go in there.

It's not supported by the spec publishing several OTEL_CTX memory locations -- readers are only looking for one so they'll either error or ignore the others.

@datadog-prod-us1-3

datadog-prod-us1-3 Bot commented Jun 29, 2026

Copy link
Copy Markdown

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 6 Pipeline jobs failed

Lint | clippy #windows-latest msrv   View in Datadog   GitHub Actions

Lint | clippy #windows-latest nightly   View in Datadog   GitHub Actions

Lint | clippy #windows-latest stable   View in Datadog   GitHub Actions

View all 6 failed jobs.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 80.50%
Overall Coverage: 74.42% (-0.03%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 99e3aaa | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts

dd-octo-sts Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Artifact Size Benchmark Report

aarch64-alpine-linux-musl
Artifact Baseline Commit Change
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.so 7.88 MB 7.88 MB 0% (0 B) 👌
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.a 85.88 MB 85.87 MB --.01% (-13.34 KB) 💪
aarch64-unknown-linux-gnu
Artifact Baseline Commit Change
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.61 MB 10.61 MB --.01% (-1.21 KB) 💪
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.a 97.08 MB 97.06 MB --.01% (-14.97 KB) 💪
libdatadog-x64-windows
Artifact Baseline Commit Change
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.dll 25.45 MB 25.45 MB 0% (0 B) 👌
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.lib 88.04 KB 88.04 KB 0% (0 B) 👌
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.pdb 184.55 MB 184.55 MB +0% (+8.00 KB) 👌
/libdatadog-x64-windows/debug/static/datadog_profiling_ffi.lib 945.18 MB 945.18 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.dll 8.32 MB 8.32 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.lib 88.04 KB 88.04 KB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.pdb 24.61 MB 24.60 MB --.03% (-8.00 KB) 💪
/libdatadog-x64-windows/release/static/datadog_profiling_ffi.lib 49.02 MB 49.02 MB 0% (0 B) 👌
libdatadog-x86-windows
Artifact Baseline Commit Change
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.dll 22.05 MB 22.05 MB 0% (0 B) 👌
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.lib 89.42 KB 89.42 KB 0% (0 B) 👌
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.pdb 188.58 MB 188.59 MB +0% (+8.00 KB) 👌
/libdatadog-x86-windows/debug/static/datadog_profiling_ffi.lib 934.19 MB 934.19 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.dll 6.43 MB 6.43 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.lib 89.42 KB 89.42 KB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.pdb 26.42 MB 26.42 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/static/datadog_profiling_ffi.lib 46.63 MB 46.63 MB 0% (0 B) 👌
x86_64-alpine-linux-musl
Artifact Baseline Commit Change
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.a 76.56 MB 76.55 MB --.01% (-9.21 KB) 💪
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.so 8.77 MB 8.77 MB 0% (0 B) 👌
x86_64-unknown-linux-gnu
Artifact Baseline Commit Change
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.a 92.07 MB 92.06 MB --.01% (-9.57 KB) 💪
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.69 MB 10.69 MB -0% (-576 B) 👌

Comment on lines 279 to 282
fence(Ordering::SeqCst);
AtomicU64::from_ptr(addr_of_mut!((*header).monotonic_published_at_ns))
(*header)
.monotonic_published_at_ns
.store(published_at_ns, Ordering::Relaxed);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fence is unnecessary if I change the store to use Ordering::SeqCst, yeah? In the previous state, it did a SeqCst fence with a Relaxed store, I didn't change it, I just changed to use the AtomicU64 natively instead of through a ptr.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read a bunch of material which helped me understand some other cases better but... not precisely this one. If anyone can educate me here, I'd appreciate understanding how they'd be different and which one is desired!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and seqcst is even too strong, it only needs release. If you're interested, you can read this paper on synchronization for the seqlock algorithm.

But there is a problem that makes reasoning purely about the C++ memory model here insufficient, and that is that we need to support out-of-process readers with process_vm_readv. This requires understanding both a) what happens architecturally on the writer and b) how process_vm_readv (which I think does just memcpy page by page) and access_remote_vm work exactly.

In any case, we only need to support amd64 and arm64, so just a Release store should be enough.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is -- let's go with the strongest barrier we have. Why? Because the process context is not performance-sensitive, and the cost of getting this detail wrong is high.

So even though we need barriers here for correctness, I don't think we should be trying to think about them in terms of performance. If we had a lock that would use a much stronger atomic and nobody would be mad if we grab a log once an hour so that's the mental model to go for here ;)

-- my 0.02c :D

@yannham yannham Jul 1, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fence is unnecessary if I change the store to use Ordering::SeqCst, yeah? In the previous state, it did a SeqCst fence with a Relaxed store, I didn't change it, I just changed to use the AtomicU64 natively instead of through a ptr.

Yes, I believe SeqCst store is more constraining than a SeqCst fence + a relaxed store. The difference is rather technical, but it's that in the latter case, the store isn't part of the total sequential consistent order. For all other orderings, a fence with ordering O followed (resp. preceded by) a relaxed store (resp. a relaxed load) is strictly equivalent to a store (resp. a load) with ordering O. The main usage for fences is that you can only use one for several atomic operations at once, or that you can use them conditionally. Here there's no strong motivation other than following closely the spec.

Yes, and seqcst is even too strong, it only needs release. If you're interested, you can read this paper on synchronization for the seqlock algorithm.

Yup. We used SeqCst mostly because:

  • the process context publication isn't performance-critical (happens once or not very often)
  • since the whole seqlock-like thing is fishy in Rust, we upgraded to the strongest ordering.

But there is a problem that makes reasoning purely about the C++ memory model here insufficient, and that is that we need to support out-of-process readers with process_vm_readv. This requires understanding both a) what happens architecturally on the writer and b) how process_vm_readv (which I think does just memcpy page by page) and access_remote_vm work exactly.

Yes, I personally have no idea what is the memory semantics of process_vm_readv, and in any case it can't be considered by the C++ memory model, which doesn't model OS-level operations like these. So we're a bit in best-effort places.

@yannham yannham Jul 1, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, process_vm_readv can't have stronger requirements than direct memory accesses, I believe. So if we're direct-memory-access-safe, I don't see how we wouldn't be syscall-read-memory-safe.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think the way to reason about this is about constraining what the compiler does in such a way that it can't be anything else than what we want (even though the spec may not technically promise it...)

@morrisonlevi morrisonlevi force-pushed the levi/otel-thread-context branch from 19c2470 to d363e03 Compare June 30, 2026 22:49
@yannham

yannham commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Out of curiosity, what's the motivation for implementing a reader on the libdatadog side? A seqlock-like reader is currently impossible to implement in Rust without UB, at least in theory (see RFC). People still do it, it seems to compile fine as of today, but it's still fragile looking ahead.

@cataphract

Copy link
Copy Markdown
Contributor Author

@yannham

Out of curiosity, what's the motivation for implementing a reader on the libdatadog side? A seqlock-like reader is currently impossible to implement in Rust without UB, at least in theory (see RFC). People still do it, it seems to compile fine as of today, but it's still fragile looking ahead.

In this case, we could just make the pointer and the size atomic and access them with relaxed memory order. This would remove the technicality of the race being UB. But even without it, I don't think it's much of a problem. I'm pretty sure the exact same code would be generated.

@yannham

yannham commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

My original question still stands (as it's not written in the PR description - to be clear I haven't look at the PR code yet in details): what is the motivation for having a reader in libdatadog?

@cataphract

cataphract commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

My original question still stands (as it's not written in the PR description - to be clear I haven't look at the PR code yet in details): what is the motivation for having a reader in libdatadog?

I could be implemented somewhere else. I guess the same reason the writer is implemented in libdatadog -- to do it uniformly across tracers. And with the writer already implemented in it, and the reader being sort of implemented already for the tests, everything stays more cohesive.

That said, I have no great objection to this being moved to dd-trace-php

@bwoebi

bwoebi commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

More precisely: You need a reader in (PHP) profiling to support using the profiler with another otel compatible tracer (other than our tracer) running in the same process and being able to access that tracers context data.

libdatadog is the right place for this - it's not necessarily a pure PHP concern.

@morrisonlevi

morrisonlevi commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

In practice we're allocating a whole OS page to put this header in there, and the header is like 32 bytes. That leaves roughly 4,064 bytes (because we're talking about Linux here, and nearly everything is running 4096 pages or more on Linux) to put a serialized payload into. Why use a separate allocation for the payload? Nobody should use more than ~4k bytes for this purpose, methinks? And this utilizes otherwise wasted space. If the spec is flexible at this stage, instead of storing a ptr you can store a length for the usable portion of the page, which can vary e.g. 16 KiB on macOS if we ever port there.

I think someone may have hinted at doing this in the discussion already, and the spec hints at an optimization, but does anyone (@ivoanjo?) have a problem mandating that? Because otherwise you can't really do a safe read, if the ptr has a separate lifetime. I suppose it could point to some other safe place like a static section of memory, but that doesn't really seem too useful compared to just using the memory page. Torn reads can be handled, but a separate lifetime complicates making that torn read safe for both threads and out-of-process readers.

@yannham

yannham commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@morrisonlevi Regarding the allocation living elsewhere and not in the header, it's a pure matter of doing the simplest thing for the first implementation. The plan has indeed been to put it in the header as a next step. No other technical reason than TODO. I don't know why this is not mandated in the spec, or why you'd want a separate allocation. I'll leave that part to Ivo 🙂

For the correctness part, we've casually discussed with @scottgerring and @nsavoire yesterday. My current understanding is that having a separate allocation is OK because:

  • if you try to read a page that has been unmapped before the call to process_vm_readv, you'll get EINVAL
  • if you try to read a page that is being unmapped in the middle of a process_vm_readv, presumably you'll either get EINVAL or the returned number of bytes read will be lower than what you expect
  • if you try to read an allocation that has been deallocated but the page is still mapped, you'll successfully read gibberish

But most importantly, in any of those cases, you don't get a crash. From there, y you'll also be able to know because you check the published_at after fully reading the payload, and if the writer started a modification/dropped the payload, you should see either 0 or a different value than the first one.

@cataphract

cataphract commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

@yannham so the plan is to use process_vm_readv even for in-process reads? I checked the linux source and actually this seems doable because the kernel special cases a process reading itself and doesn't require the ptrace capability. The only problems I could find were CONFIG_CROSS_MEMORY_ATTACH=n or something like seccomp blocking it.

@yannham

yannham commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@yannham so the plan is to use process_vm_readv even for in-process reads? I checked the linux source and actually this seems doable because the kernel special cases a process reading itself and doesn't require the ptrace capability. The only problems I could find were CONFIG_CROSS_MEMORY_ATTACH=n or something like seccomp blocking it.

I think the spec was mainly written with an out-of-process, privileged reader in mind, typically an eBPF profiler. If you really need to read from the same process, I would be inclined to go with process_vm_readv if possible indeed, or really any syscall or Unix API that let you read your own memory without crashes/interrupts. For example ChatGPT proposes to create a pipe, write to it ssize_t n = write(pipe_fd, addr, len); and then read back safely from the pipe. write will convert converts faults or unmapped memory to an error return value. It's a bit hacky, but might work as well?

@cataphract

Copy link
Copy Markdown
Contributor Author

For example ChatGPT proposes to create a pipe, write to it ssize_t n = write(pipe_fd, addr, len); and then read back safely from the pipe. write will convert converts faults or unmapped memory to an error return value. It's a bit hacky, but might work as well?

I don't think that would help. Yes, you could tell whether the memory was mapped at the point of the write call, but between then and the time of your actual read it could have become unmapped

@yannham

yannham commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

I don't think that would help. Yes, you could tell whether the memory was mapped at the point of the write call, but between then and the time of your actual read it could have become unmapped

I would expect that if you write to a file pipe, the memory is copied somewhere, or double mapped, or whatever. In any case, however the kernel handles this, the important point is that neither write nor read are APIs that will SIGSEGV because your source buffer is invalid or becomes invalid during the operation. It will always return an error instead. Which is the property we're looking for, I believe: either those error values or the bracket check of monotonic_published_at will let us discard gibberish if something like that did happen.

In the memfd path, the header becomes immediately discoverable.
Therefore, we cannot do an non-atomic write to monotonic_published_at_ns
with value 0, as this could race with a reader that did an acquire read
on monotonic_published_at_ns.

This write is also unnecessary because the mapping comes
zero-initialized, and we'd be writing a zero (so it would also _likely_
be harmless anyway, even if violating the memory model).

Note that there can be no race in signature/version. These are only read
if the published_at is nonzero, and the release-acquire relationship
guarantees that their final non-zero values are visible if published_at
is nonzero.
@cataphract

Copy link
Copy Markdown
Contributor Author

I've pushed a commit that uses process_vm_readv to read the payload, among other changes. Please take a look

Comment on lines 5 to +6
//! context](https://github.com/open-telemetry/opentelemetry-specification/pull/4719)
//! specification.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! context](https://github.com/open-telemetry/opentelemetry-specification/pull/4719)
//! specification.
//! context specification](https://github.com/open-telemetry/opentelemetry-specification/pull/4719).

//! seq1 = seq.load(m_o_relaxed);
//! } while (seq0 & 1 || seq0 != seq1);
//! ...
//! }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit too low-level for a module-level documentation. I would remove the C-like code from here, at best give a short description, or just mention seqlock.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the most useful thing about documentation like this is the intuition about why the algorithm is correct / what invariants it upholds. The implementation (even in pseudocode) should fall out of the invariants, not the other way around.

//! instead of even numbers.
//! We also forbid concurrent writers, and leave the reader retries to the
//! discretion of the caller.
//! We ignore the corner case where time returns 0.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the part about reading from syscall here, especially given the discussion on this PR around reading from the same process.

/// Initial publication of the process context. Creates an appropriate memory mapping.
fn publish(payload: Vec<u8>) -> anyhow::Result<Self> {
fn publish(payload: Vec<u8>) -> io::Result<Self> {
let payload_size = payload

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add a type annotation on payload_size to know when we read this line what conversion we're doing, and in which case it would fail.

fence(Ordering::SeqCst);
AtomicU64::from_ptr(addr_of_mut!((*header).monotonic_published_at_ns))
.store(published_at_ns, Ordering::Relaxed);
ptr::addr_of_mut!((*header).signature).write(*SIGNATURE);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we start by writing 0 to monotonic_published_at? Usually the memory is allocated zeroed, but I'm not sure we can rely on that in every environment possible. I don't really remember all the ins and outs of the naming part but we already use MAPPING_NAME in MemMapping::new(), so I wonder if in some cases the context could already be discoverable at this point.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, on the mmap side, MAP_ANONYMOUS has this guarantee:

   MAP_ANONYMOUS
          The mapping is not backed by any file; its contents are
          initialized to zero.

And on memfd_create + ftruncate, the ftruncate linux man page says:

 If the file previously was larger than this size, the extra data
 is lost.  If the file previously was shorter, it is extended, and
 the extended part reads as null bytes ('\0').

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough 👍 maybe adding a small comment saying that would be a nice plus for future code readers


// The returned size is guaranteed to be larger or equal to the size of `MappingHeader`.
fn mapping_size() -> usize {
const fn mapping_size() -> usize {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A previous version was using a non-const implementation (using libc to get the page size typically). If not especially needed I would avoid making the const guarantee. I trust rustc to be smart enough to inline and constant-propagate the same way anyway, if it's for performance reason.

/// Locates the OTEL_CTX mapping at construction. Call [`read`](Self::read) repeatedly to fetch
/// updated context data without re-parsing `/proc/self/maps`, as long as the process has not
/// forked. After a `fork()`, reads fail and a new reader must be constructed.
pub struct ProcessContextSelfReader {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the reader side go in a submodule? And even maybe the writer side as well? I feel like the use-case is different from the publisher, which is what SDK/tracers will ever need, and this source file starts to grow quite a bit.


let published_at = header.monotonic_published_at_ns.load(Ordering::Acquire);
if published_at == 0 {
return Err(io::Error::new(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are going to be the consumer of this API, but don't you want to be able to detect this error? This sounds like a non fatal one, as opposed to the fork issue, or other IO errors. One possibility would be to have a retry-loop version, or just return more structured errors.

Comment on lines +744 to +747
/// # Safety
///
/// This function may only be called if it can be guaranteed that there are no in-process
/// readers, or at least that they will not be used after the call.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be mirrored in the FFI version's documentation.

Comment on lines +763 to +765
.store(0, Ordering::Relaxed);
}
fence(Ordering::Release);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds equivalent and a bit simpler? Note that unfortunately I don't think this guarantees anything like the payload will be dropped after the store. Stuff can move up before a release operation (fence/store). The latter just guarantees that whatever happens-before the release operation, will happens-before a matching acquire store that will load 0. It makes no guarantee about what happens after the release operation. You'd need acquire, but that doesn't exist for stores. Maybe an acquire-swap would do.

Suggested change
.store(0, Ordering::Relaxed);
}
fence(Ordering::Release);
.store(0, Ordering::Release);
}

@cataphract cataphract Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here is a bit of a corner case, but what we want here is that these two stores are not reordered, that we store 0 before doing the drop. The atomic release prevents reordering with memory operations that are sequenced before it, not after. So for an atomic release to work, it would have to be on the drop, which is not possible (the whole thing is also of undefined anyway, because the drop is not doing any sort of atomic operation)

@yannham yannham Jul 2, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The atomic release prevents reordering with memory operations that are sequenced before it, not after.

Yes, and that is my point. Here the drop happens after the release store. So you don't get any guarantee regarding the respective order of the drop and the release store from a reader that acquire-load the value 0. As far as the C++20 memory model is concerned, the drop could look like it happens before the store from the reader POV. Release doesn't prevent that re-ordering.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the reader loads the value 0 it will not even attempt to load payload. The risk is that it doesn't read the value 0 and then attempt to read a dropped value. Note that this can still happen (it's a variation on the torn read case), it's just less likely with this change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are you saying that it's a "best-effort-although-technically-incorrect" approach, where you reason as if the drop was performing atomic operations somehow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, yes, you're generally right that this doesn't provide any formal guarantee of correctness. For the synchronization to be complete, you'd need to read-acquire on the dropped value. This is more an attempt of having of the compiler generate a StoreStore barrier on the writer side so that it's less likely that another cpu sees value nonzero and then a dropped payload

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If it's a safety net, what do you think about using a stronger SeqCst barrier? On ARM for example I think it's always compiled to a DMB ISH, while release fences might be compiled to something slightly weaker, such as DMB ISHLD; DMB ISHST. If it's a best-effort/out of the memory model thing and not performance sensitive... let's be as strong as we can ? 🤷

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was going to say the same

You'd need acquire, but that doesn't exist for stores

But SeqCst does

@danielsn danielsn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a broader comment, this whole file seems large. I wonder if it would benefit from being split into modules

//! seq1 = seq.load(m_o_relaxed);
//! } while (seq0 & 1 || seq0 != seq1);
//! ...
//! }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the most useful thing about documentation like this is the intuition about why the algorithm is correct / what invariants it upholds. The implementation (even in pseudocode) should fall out of the invariants, not the other way around.

//! instead of even numbers.
//! We also forbid concurrent writers, and leave the reader retries to the
//! discretion of the caller.
//! We ignore the corner case where time returns 0.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

//!
//! Although we instead use 0 to signal the writer is progress and a timestamp
//! instead of even numbers.
//! We also forbid concurrent writers, and leave the reader retries to the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How?

//! seq.store(seq0 + 2, m_o_release);
//! }
//!
//! Although we instead use 0 to signal the writer is progress and a timestamp

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of magic constants, can we use defined enums?

//! }
//!
//! Although we instead use 0 to signal the writer is progress and a timestamp
//! instead of even numbers.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the original algorithm constrain to even numbers? Do timestamps uphold the same invariant?

Comment on lines +763 to +765
.store(0, Ordering::Relaxed);
}
fence(Ordering::Release);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was going to say the same

You'd need acquire, but that doesn't exist for stores

But SeqCst does

///
/// # Safety
///
/// This function may only be called if it can be guaranteed that there are no in-process

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to check this, even if just on debug builds?

.collect()
}

fn find_attr<'a>(attrs: &'a [KeyValue], key: &str) -> Option<&'a AnyValue> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there few enough attrs that a linear scan is fine?

fn is_named_otel_mapping(line: &str) -> bool {
let trimmed = line.trim_end();

// The name of the mapping is the 6th column. The separator changes (both ' ' and '\t')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this format documented anywhere, or is this reverse engineering?


// pairs with the first release fence on update() to ensure that, if we read data
// updated after the initial published time, we at least see the published
// time being set to 0 in the next load of the published time (or we could

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it allowed to go back from a published time to 0?

monotonic_published_at_ns being zero is reserved to indicate the context is being mutated and is not yet ready for reading.

@morrisonlevi morrisonlevi requested a review from a team as a code owner July 3, 2026 00:54
@morrisonlevi morrisonlevi force-pushed the levi/otel-thread-context branch from 3379921 to 99e3aaa Compare July 3, 2026 00:55

@yannham yannham left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't feel comfortable reviewing a PR that touches to many moving parts of the process context. We somehow arrived to an agreement on the initial version of the reader and minor publish changes. Would that be possible to revert to that state, address the last review comment one way or another, and keep other changes as separate PRs with proper motivation?

@ivoanjo ivoanjo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll +1 on splitting this PR. Also, if we're making the reader directly read from the output of the writer, we can move it to a separate file as at that point they're very independent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants