Skip to content

Add main_thread_forkserver backend#463

Open
goodboy wants to merge 53 commits into
mainfrom
mtf_backend
Open

Add main_thread_forkserver backend#463
goodboy wants to merge 53 commits into
mainfrom
mtf_backend

Conversation

@goodboy

@goodboy goodboy commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Add main_thread_forkserver spawn backend

Motivation

tractor wants a fork-without-exec spawn path: forking a subactor
inherits the parent's warm memory image (no re-import, no exec
cold-start), which is the groundwork for a future
subinterpreter-isolated child runtime. The trouble is os.fork() is
hostile on two fronts.

Two empirical CPython properties drive the whole design:

  • os.fork() from a non-main subinterpreter is refused by
    CPython — PyOS_AfterFork_Child() aborts the child with Fatal Python error: not main interpreter.
  • os.fork() from a regular threading.Thread attached to the
    main interpreter — one that has never entered a subint or
    trio — works cleanly (validated across four scenarios on py3.14).

This branch operationalizes the second fact as a first-class tractor
spawn backend: spawn a clean worker thread, fork in it, hand the
child pid back to the trio task, and wrap it in a
trio.Process-shaped shim so the existing soft_kill/hard-reap
machinery keeps working unchanged.

It supersedes #447 (the original subint_forkserver
attempt), rebuilt on the factored stack with an honest rename: the
working impl forks from a main-interp worker thread — there is no
subinterpreter anywhere in parent or child — so it's now variant-1
main_thread_forkserver
, while the aspirational
subint-isolated-child design is demoted to a reserved variant-2
subint_forkserver
stub. The one-module impl is deceptively small;
the bulk of the series is the supporting subint scaffold plus the
cancel-cascade / orphan-SIGINT hang diagnosis that only fork-based
teardown surfaces.

Src of research

Summary of changes

  • Land variant-1 main_thread_forkserver plus the generic
    fork-from-main-interp-worker-thread primitives it's built on
    (spawn._main_thread_forkserver): fork_from_worker_thread(), the
    _ForkedProc trio.Process shim (pidfd-cancellable .wait(),
    .terminate(), .kill()), and main_thread_forkserver_proc()
    which reuses _spawn's soft_kill/hard-reap unchanged.
  • Add the min-viable subint spawn backend (spawn._subint,
    py3.14+): _interpreters-based subactor exec driven on a dedicated
    OS thread, a destroy-race fix via that driver thread, and
    hard-kill-timeout-bounded teardown shields.
  • Reserve variant-2 subint_forkserver as a placeholder stub
    (spawn._subint_forkserver): the legacy key aliases to variant-1's
    spawn-coro for now; the real subint-isolated-child runtime is gated
    on msgspec PEP-684 support.
  • Add the subint_fork (variant-2b) CPython-blocked workaround
    smoketest documenting the non-main-interp fork abort.
  • Refactor spawn._spawn backend dispatch and wire
    runtime._runtime.Actor._from_parent() so both forkserver keys
    route through the SpawnSpec IPC handshake.
  • Add a tests.spawn subpkg — test_main_thread_forkserver (incl.
    an xfail(strict=True) orphan-SIGINT DRAFT) and
    test_subint_cancellation audits.
  • Add main_thread_forkserver CI matrix rows with a matrix-driven
    --capture= mode (sys for fork backends to dodge the fork-child
    × capture-fd deadlock).
  • Document the supporting diagnosis as ai/conc-anal/ post-mortems
    (SIGINT starvation, cancel-cascade too-slow, capture-pipe leak,
    PEP-684 thread constraints) plus ai/prompt-io/ session logs.

Scopes changed

  • tractor.spawn._main_thread_forkserver
    • new; variant-1 impl + fork primitives, extracted/renamed out of
      the original _subint_forkserver.
  • tractor.spawn._subint_forkserver
    • shrunk to a variant-2 placeholder stub.
  • tractor.spawn._subint, tractor.spawn._subint_fork
    • new backend scaffolds (subint exec + CPython-blocked fork
      smoketest).
  • tractor.spawn._spawn
    • backend dispatch + new SpawnMethodKey entries.
  • tests.spawn
    • new subpkg; test file renamed test_subint_forkserver ->
      test_main_thread_forkserver.

TODOs before landing

Future follow up

  • Migrate the worker-thread primitives from a raw
    threading.Thread to trio.to_thread.run_sync() once msgspec
    PEP-684 isolated subints dissolve the GIL-starvation /
    tstate-recycling hazards — see #450. The dedicated worker
    thread itself stays (MTF's fork-from-main-tstate invariant is
    permanent); only its trio integration changes, and the destroy-race
    hazard needs empirical re-test.
  • (#444) Per-backend spawn-module split — landed in Spawner modules: split up subactor spawning backends #444;
    _subint / _*_forkserver extend that scaffold, and this PR lands
    #445's Phase B (_subint) + Phase C (harness).
  • Resolve the capture-fd cancel-cascade hang — see #449.
  • Resolve the mtf cancel-cascade >9s contention hang — see #451.
  • Implement variant-2 subint_forkserver (subint-isolated child
    runtime) once msgspec#1026 / PEP-684 isolated
    subints land. Currently a stub in spawn._subint_forkserver.
  • Surface subint-bootstrap excs to the parent task —
    spawn._subint ?TODO (~L256): hard ImportError/syntax failures
    only hit the thread excepthook today.
  • Enrich SpawnSpec bootstrap via
    _interpreters.set___main___attrs() for non-repr-roundtrippable
    values — spawn._subint ?TODO (~L205).
  • Wire the trio-native child-SIGINT mode (child_sigint='trio')
    to close the orphan-SIGINT gap — scaffolded behind the
    xfail(strict=True) DRAFT in
    tests.spawn.test_main_thread_forkserver.
  • (non-gating) Guard os.pidfd_open() for non-Linux in
    _ForkedProc (spawn._main_thread_forkserver ~L698) — it's
    Linux-only, so the macOS CI row AttributeErrors. The
    no_pidfd_open branch already carries the fix (hasattr guard +
    waitpid fallback). Not a merge blocker — macOS support isn't
    required while Add main_thread_forkserver backend #463 is pre-production.

Links

  • #379 — subint experiment tracker (this is its fork() payoff)
  • #447 — original subint_forkserver PR, superseded by this
  • #444 — spawner-module split this branch builds on (merged)
  • #445 — post-Spawner modules: split up subactor spawning backends #444 subint tracker; Phase B/C land here
  • #449 — capture-fd cancel-cascade hang
  • #450 — audit thread constraints once PEP-684 lands
  • #451 — mtf cancel-cascade >9s under contention
  • PEP 684 — per-interpreter GIL
  • msgspec#1026 — msgspec PEP-684 support (variant-2 gate)
  • trio#1614os.fork() × trio post-fork hazards

(this pr content was generated in some part by claude-code)

@goodboy goodboy changed the base branch from main to custom_log_levels_api June 22, 2026 18:42
@goodboy goodboy changed the title Mtf backend Add main_thread_forkserver backend Jun 22, 2026
@goodboy goodboy force-pushed the mtf_backend branch 3 times, most recently from fbaf933 to 9584bc7 Compare June 22, 2026 18:55
@goodboy goodboy force-pushed the custom_log_levels_api branch from 92bdc97 to 7387bde Compare June 22, 2026 19:19
@goodboy goodboy requested a review from Copilot June 22, 2026 19:55

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds new spawn backends to tractor’s spawning subsystem, centered on a fork-without-exec path (main_thread_forkserver) and a Python 3.14+ subinterpreter-based backend (subint), plus supporting tests and CI matrix adjustments to exercise fork-based backends safely.

Changes:

  • Add main_thread_forkserver spawn backend (fork from a clean main-interpreter worker thread) and supporting fork primitives/shims.
  • Add subint spawn backend (legacy-config subinterpreters) and reserve subint_forkserver / subint_fork as explicit placeholders/stubs.
  • Add tests/spawn coverage and update CI to use --capture=sys for fork-based backends to avoid known fork × --capture=fd deadlocks.

Reviewed changes

Copilot reviewed 25 out of 26 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tractor/spawn/_main_thread_forkserver.py Implements the new forkserver backend and fork/wait shims.
tractor/spawn/_subint.py Implements the subint backend using _interpreters (py3.14+ gate).
tractor/spawn/_subint_forkserver.py Variant-2 placeholder: reserved key that currently raises NotImplementedError.
tractor/spawn/_subint_fork.py Explicit stub for the CPython-blocked fork-from-subinterpreter attempt.
tractor/spawn/_spawn.py Registers new spawn-method keys and adds wait_for_peer_or_proc_death() helper.
tractor/runtime/_runtime.py Routes forkserver keys through the SpawnSpec handshake path.
tests/spawn/test_subint_cancellation.py Adds cancellation/teardown audit tests for the subint backend.
tests/spawn/test_main_thread_forkserver.py Adds integration tests for the new forkserver backend and stubs.
.github/workflows/ci.yml Adds matrix rows + capture-mode switching for fork-based backends.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tractor/spawn/_subint.py
Comment on lines +297 to +301
driver_thread.start()

try:
event, chan = await ipc_server.wait_for_peer(uid)
except trio.Cancelled:
Comment on lines +22 to +27
Spawn-method key: `'main_thread_forkserver'`. The legacy
`'subint_forkserver'` key currently aliases here too — see
`tractor.spawn._subint_forkserver` for the future variant-2
(subint-isolated-child runtime, gated on
[jcrist/msgspec#1026](https://github.com/jcrist/msgspec/issues/1026))
that key is reserved for.
Comment thread tractor/spawn/_spawn.py
Comment on lines +224 to +232
# `concurrent.interpreters` wrapper (PEP 734). See
# `tractor.spawn._subint` for the detailed
# reasoning. `subint_fork` is blocked at the
# CPython level (raises `NotImplementedError`);
# `main_thread_forkserver` is the working
# variant-1 backend; `subint_forkserver` aliases
# to it today, reserved for the future variant-2
# subint-isolated-child runtime once upstream
# msgspec#1026 unblocks.
Comment on lines +174 to +177
@pytest.mark.timeout(
3, # NOTE never passes pre-3.14+ subints support.
method='thread',
)
Comment on lines +572 to +575
# step 4+5: SIGINT the orphan, poll for exit.
os.kill(child_pid, signal.SIGINT)
timeout: float = 6.0
cleanup_deadline: float = time.monotonic() + timeout
@goodboy goodboy force-pushed the custom_log_levels_api branch from 7387bde to ee79939 Compare June 22, 2026 21:15
@goodboy goodboy force-pushed the custom_log_levels_api branch from ee79939 to 3b022d4 Compare June 23, 2026 22:53
@goodboy goodboy changed the base branch from custom_log_levels_api to trio_033_upgrade June 24, 2026 14:02
@goodboy goodboy force-pushed the trio_033_upgrade branch from e86b5d4 to beca584 Compare June 24, 2026 18:51
@goodboy goodboy force-pushed the trio_033_upgrade branch from beca584 to 8142fec Compare June 24, 2026 20:21
@goodboy goodboy force-pushed the trio_033_upgrade branch from 8142fec to 7f6782b Compare June 24, 2026 22:19
Base automatically changed from trio_033_upgrade to main June 25, 2026 03:30
goodboy added a commit that referenced this pull request Jun 28, 2026
`run_in_actor()` is slated to become a hilevel wrapper
(`runtime/_supervise.py` "TODO: DEPRECATE THIS"), so the showcase
`we_are_processes.py` shouldn't lead with it. Move it to the modern
API: each `worker_<i>` subactor runs a `@tractor.context`
`endpoint()` that `started()`-hands its name + pid back over
`Portal.open_context()` and parks; the root sleeps then raises on
purpose so the runtime reaps the whole tree (zero zombies).

The subs spawn concurrently from bg `trio.Task`s so each child's
cold `import tractor` (~0.4s, see #470) overlaps instead of
stacking; a comment flags the coming `main_thread_forkserver`
backend (#463) which'll make serial spawns cheap enough to just
loop.

Match the landing prose to the snippet — name the `Context` +
`started()` handshake it now leads with.

Also, document `--watch examples` on the `sphinx-autobuild` cmds so
edits to `literalinclude`-d example scripts (which live outside
`docs/`) live-reload too.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
@goodboy goodboy marked this pull request as ready for review June 28, 2026 02:23
goodboy added a commit that referenced this pull request Jun 28, 2026
`run_in_actor()` is slated to become a hilevel wrapper
(`runtime/_supervise.py` "TODO: DEPRECATE THIS"), so the showcase
`we_are_processes.py` shouldn't lead with it. Move it to the modern
API: each `worker_<i>` subactor runs a `@tractor.context`
`endpoint()` that `started()`-hands its name + pid back over
`Portal.open_context()` and parks; the root sleeps then raises on
purpose so the runtime reaps the whole tree (zero zombies).

The subs spawn concurrently from bg `trio.Task`s so each child's
cold `import tractor` (~0.4s, see #470) overlaps instead of
stacking; a comment flags the coming `main_thread_forkserver`
backend (#463) which'll make serial spawns cheap enough to just
loop.

Match the landing prose to the snippet — name the `Context` +
`started()` handshake it now leads with.

Also, document `--watch examples` on the `sphinx-autobuild` cmds so
edits to `literalinclude`-d example scripts (which live outside
`docs/`) live-reload too.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 28 commits June 29, 2026 19:20
Follow-up to 72d1b90 (was prev commit adding `debug_mode` for
`subint_forkserver`): that commit wired the runtime-side
`subint_forkserver` SpawnSpec-recv gate in `Actor._from_parent`, but the
`subint_forkserver_proc` child-target was still passing
`spawn_method='trio'` to `_trio_main` — so `Actor.pformat()` / log lines
would report the subactor as plain `'trio'` instead of the actual
parent-side spawn mechanism. Flip the label to `'subint_forkserver'`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e31eb8d)
Also, some slight touchups in `.spawn._subint`.

(cherry picked from commit 1e357dc)
New `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`
captures a descendant-leak surfaced while wiring
`subint_forkserver` into the full test matrix:
running `tests/test_cancellation.py` under
`--spawn-backend=subint_forkserver` reproducibly
leaks **exactly 5** `subint-forkserv` comm-named
child processes that survive session exit, each
holding a `LISTEN` on `:1616` (the tractor default
registry addr) — and therefore poisons every
subsequent test session that defaults to that addr.

Deats,
- TL;DR + ruled-out checks confirming the procs are
  ours (not piker / other tractor-embedding apps) —
  `/proc/$pid/cmdline` + cwd both resolve to this
  repo's `py314/` venv
- root cause: `_ForkedProc.kill()` is PID-scoped
  (plain `os.kill(SIGKILL)` to the direct child),
  not tree-scoped — grandchildren spawned during a
  multi-level cancel test get reparented to init and
  inherit the registry listen socket
- proposed fix directions ranked: (1) put each
  forkserver-spawned subactor in its own process-
  group (`os.setpgrp()` in fork-child) + tree-kill
  via `os.killpg(pgid, SIGKILL)` on teardown,
  (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit
  `/proc/<pid>/task/*/children` walk. Vote: (1) —
  POSIX-standard, aligns w/ `start_new_session=True`
  semantics in `subprocess.Popen` / trio's
  `open_process`
- inline reproducer + cleanup recipe scoped to
  `$(pwd)/py314/bin/python.*pytest.*spawn-backend=
  subint_forkserver` so cleanup doesn't false-flag
  unrelated tractor procs (consistent w/
  `run-tests` skill's zombie-check guidance)

Stopgap hygiene fix (wiring `reg_addr` through the 5
leaky tests in `test_cancellation.py`) is incoming as
a follow-up — that one stops the blast radius, but
zombies still accumulate per-run until the real
tree-kill fix lands.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e3f4f5a)
Major rewrite of
`subint_forkserver_test_cancellation_leak_issue.md`
after empirical investigation revealed the earlier
"descendant-leak + missing tree-kill" diagnosis
conflated two unrelated symptoms:

1. **5-zombie leak holding `:1616`** — turned out to
   be a self-inflicted cleanup bug: `pkill`-ing a bg
   pytest task (SIGTERM/SIGKILL, no SIGINT) skipped
   the SC graceful cancel cascade entirely. Codified
   the real fix — SIGINT-first ladder w/ bounded
   wait before SIGKILL — in e5e2afb (`run-tests`
   SKILL) and
   `feedback_sc_graceful_cancel_first.md`.
2. **`test_nested_multierrors[subint_forkserver]`
   hangs indefinitely** — the actual backend bug,
   and it's a deadlock not a leak.

Deats,
- new diagnosis: all 5 procs are kernel-`S` in
  `do_epoll_wait`; pytest-main's trio-cache workers
  are in `os.waitpid` waiting for children that are
  themselves waiting on IPC that never arrives —
  graceful `Portal.cancel_actor` cascade never
  reaches its targets
- tree-structure evidence: asymmetric depth across
  two identical `run_in_actor` calls — child 1
  (3 threads) spawns both its grandchildren; child 2
  (1 thread) never completes its first nursery
  `run_in_actor`. Smells like a race on fork-
  inherited state landing differently per spawn
  ordering
- new hypothesis: `os.fork()` from a subactor
  inherits the ROOT parent's IPC listener FDs
  transitively. Grandchildren end up with three
  overlapping FD sets (own + direct-parent + root),
  so IPC routing becomes ambiguous. Predicts bug
  scales with fork depth — matches reality: single-
  level spawn works, multi-level hangs
- ruled out: `_ForkedProc.kill()` tree-kill (never
  reaches hard-kill path), `:1616` contention (fixed
  by `reg_addr` fixture wiring), GIL starvation
  (each subactor has its own OS process+GIL),
  child-side KBI absorption (`_trio_main` only
  catches KBI at `trio.run()` callsite, reached
  only on trio-loop exit)
- four fix directions ranked: (1) blanket post-fork
  `closerange()`, (2) `FD_CLOEXEC` + audit,
  (3) targeted FD cleanup via `actor.ipc_server`
  handle, (4) `os.posix_spawn` w/ `file_actions`.
  Vote: (3) — surgical, doesn't break the "no exec"
  design of `subint_forkserver`
- standalone repro added (`spawn_and_error(breadth=
  2, depth=1)` under `trio.fail_after(20)`)
- stopgap: skip `test_nested_multierrors` + multi-
  level-spawn tests under the backend via
  `@pytest.mark.skipon_spawn_backend(...)` until
  fix lands

Killing the "tree-kill descendants" fix-direction
section: it addressed a bug that didn't exist.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 35da808)
Implements fix-direction (1)/blunt-close-all-FDs from
b71705b (`subint_forkserver` nested-cancel hang
diag), targeting the multi-level cancel-cascade
deadlock in
`test_nested_multierrors[subint_forkserver]`.

The diagnosis doc voted for surgical FD cleanup via
`actor.ipc_server` handle as the cleanest approach,
but going blunt is actually the right call: after
`os.fork()`, the child immediately enters
`_actor_child_main()` which opens its OWN IPC
sockets / wakeup-fd / epoll-fd / etc. — none of the
parent's FDs are needed. Closing everything except
stdio is safe AND defends against future
listener/IPC additions to the parent inheriting
silently into children.

Deats,
- new `_close_inherited_fds(keep={0,1,2}) -> int`
  helper. Linux fast-path enumerates `/proc/self/fd`;
  POSIX fallback uses `RLIMIT_NOFILE` range. Matches
  the stdlib `subprocess._posixsubprocess.close_fds`
  strategy. Returns close-count for sanity logging
- wire into `fork_from_worker_thread._worker()`'s
  post-fork child prelude — runs immediately after
  the pid-pipe `os.close(rfd/wfd)`, before the user
  `child_target` callable executes
- docstring cross-refs the diagnosis doc + spells
  out the FD-inheritance-cascade mechanism and why
  the close-all approach is safe for our spawn shape

Validation pending: re-run `test_nested_multierrors[subint_forkserver]`
to confirm the deadlock is gone.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 9993db0)
Two coordinated improvements to the `subint_forkserver` backend:

1. Replace `trio.to_thread.run_sync(os.waitpid, ...,
   abandon_on_cancel=False)` in `_ForkedProc.wait()`
   with `trio.lowlevel.wait_readable(pidfd)`. The
   prior version blocked a trio cache thread on a
   sync syscall — outer cancel scopes couldn't
   unwedge it when something downstream got stuck.
   Same pattern `trio.Process.wait()` and
   `proc_waiter` (the mp backend) already use.

2. Drop the `@pytest.mark.xfail(strict=True)` from
   `test_orphaned_subactor_sigint_cleanup_DRAFT` —
   the test now PASSES after 0cd0b63 (fork-child
   FD scrub). Same root cause as the nested-cancel
   hang: inherited IPC/trio FDs were poisoning the
   child's event loop. Closing them lets SIGINT
   propagation work as designed.

Deats,
- `_ForkedProc.__init__` opens a pidfd via
  `os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+)
- `wait()` parks on `trio.lowlevel.wait_readable()`,
  then non-blocking `waitpid(WNOHANG)` to collect
  the exit status (correct since the pidfd signal
  IS the child-exit notification)
- `ChildProcessError` swallow handles the rare race
  where someone else reaps first
- pidfd closed after `wait()` completes (one-shot
  semantics) + `__del__` belt-and-braces for
  unexpected-teardown paths
- test docstring's `@xfail` block replaced with a
  `# NOTE` comment explaining the historical
  context + cross-ref to the conc-anal doc; test
  remains in place as a regression guard

The two changes are interdependent — the
cancellable `wait()` matters for the same nested-
cancel scenarios the FD scrub fixes, since the
original deadlock had trio cache workers wedged in
`os.waitpid` swallowing the outer cancel.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit c20b05e)
Two new sections in
`subint_forkserver_test_cancellation_leak_issue.md`
documenting continued investigation of the
`test_nested_multierrors[subint_forkserver]` peer-
channel-loop hang:

1. **"Attempted fix (DID NOT work) — hypothesis
   (3)"**: tried sync-closing peer channels' raw
   socket fds from `_serve_ipc_eps`'s finally block
   (iterate `server._peers`, `_chan._transport.
   stream.socket.close()`). Theory was that sync
   close would propagate as `EBADF` /
   `ClosedResourceError` into the stuck
   `recv_some()` and unblock it. Result: identical
   hang. Either trio holds an internal fd
   reference that survives external close, or the
   stuck recv isn't even the root blocker. Either
   way: ruled out, experiment reverted, skip-mark
   restored.
2. **"Aside: `-s` flag changes behavior for peer-
   intensive tests"**: noticed
   `test_context_stream_semantics.py` under
   `subint_forkserver` hangs with default
   `--capture=fd` but passes with `-s`
   (`--capture=no`). Working hypothesis: subactors
   inherit pytest's capture pipe (fds 1,2 — which
   `_close_inherited_fds` deliberately preserves);
   verbose subactor logging fills the buffer,
   writes block, deadlock. Fix direction (if
   confirmed): redirect subactor stdout/stderr to
   `/dev/null` or a file in `_actor_child_main`.
   Not a blocker on the main investigation;
   deserves its own mini-tracker.

Both sections are diagnosis-only — no code changes
in this commit.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7cd47ef)
Three places that previously swallowed exceptions silently now log via
`log.exception()` so they surface in the runtime log when something
weird happens — easier to track down sneaky failures in the
fork-from-worker-thread / subint-bootstrap primitives.

Deats,
- `_close_inherited_fds()`: post-fork child's per-fd `os.close()`
  swallow now logs the fd that failed to close. The comment notes the
  expected failure modes (already-closed-via-listdir-race,
  otherwise-unclosable) — both still fine to ignore semantically, but
  worth flagging in the log.
- `fork_from_worker_thread()` parent-side timeout branch: the
  `os.close(rfd)` + `os.close(wfd)` cleanup now logs each pipe-fd close
  failure separately before raising the `worker thread didn't return`
  RuntimeError.
- `run_subint_in_worker_thread._drive()`: when
  `_interpreters.exec(interp_id, bootstrap)` raises a `BaseException`,
  log the full call signature (interp_id + bootstrap) along with the
  captured exception, before stashing into `err` for the outer caller.

Behavior unchanged — only adds observability.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 458a35c)
Third diagnostic pass on
`test_nested_multierrors[subint_forkserver]` hang.
Two prior hypotheses ruled out + a new, more
specific deadlock shape identified.

Ruled out,
- **capture-pipe fill** (`-s` flag changes test):
  retested explicitly — `test_nested_multierrors`
  hangs identically with and without `-s`. The
  earlier observation was likely a competing
  pytest process I had running in another session
  holding registry state
- **stuck peer-chan recv that cancel can't
  break**: pivot from the prior pass. With
  `handle_stream_from_peer` instrumented at ENTER
  / `except trio.Cancelled:` / finally: 40
  ENTERs, ZERO `trio.Cancelled` hits. Cancel never
  reaches those tasks at all — the recvs are
  fine, nothing is telling them to stop

Actual deadlock shape: multi-level mutual wait.

    root              blocks on spawner.wait()
      spawner         blocks on grandchild.wait()
        grandchild    blocks on errorer.wait()
          errorer     Actor.cancel() ran, but proc
                      never exits

`Actor.cancel()` fired in 12 PIDs — but NOT in
root + 2 direct spawners. Those 3 have peer
handlers stuck because their own `Actor.cancel()`
never runs, which only runs when the enclosing
`tractor.open_nursery()` exits, which waits on
`_ForkedProc.wait()` for the child pidfd to
signal, which only signals when the child
process fully exits.

Refined question: **why does an errorer process
not exit after its `Actor.cancel()` completes?**
Three hypotheses (unverified):
1. `_parent_chan_cs.cancel()` fires but the
   shielded loop's recv is stuck in a way cancel
   still can't break
2. `async_main`'s post-cancel unwind has other
   tasks in `root_tn` awaiting something that
   never arrives (e.g. outbound IPC reply)
3. `os._exit(rc)` in `_worker` never runs because
   `_child_target` never returns

Next-session probes (priority order):
1. instrument `_worker`'s fork-child branch —
   confirm whether `child_target()` returns /
   `os._exit(rc)` is reached for errorer PIDs
2. instrument `async_main`'s final unwind — see
   which await in teardown doesn't complete
3. compare under `trio_proc` backend at the
   equivalent level to spot divergence

No code changes — diagnosis-only.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit ab86f76)
Fourth diagnostic pass — instrument `_worker`'s
fork-child branch (`pre child_target()` / `child_
target RETURNED rc=N` / `about to os._exit(rc)`)
and `_trio_main` boundaries (`about to trio.run` /
`trio.run RETURNED NORMALLY` / `FINALLY`). Test
config: depth=1/breadth=2 = 1 root + 14 forked =
15 actors total.

Fresh-run results,
- **9 processes complete the full flow**:
  `trio.run RETURNED NORMALLY` → `child_target
  RETURNED rc=0` → `os._exit(0)`. These are tree
  LEAVES (errorers) plus their direct parents
  (depth-0 spawners) — they actually exit
- **5 processes stuck INSIDE `trio.run(trio_
  main)`**: hit "about to trio.run" but never
  see "trio.run RETURNED NORMALLY". These are
  root + top-level spawners + one intermediate

The deadlock is in `async_main` itself, NOT the
peer-channel loops. Specifically, the outer
`async with root_tn:` in `async_main` never exits
for the 5 stuck actors, so the cascade wedges:

    trio.run never returns
      → _trio_main finally never runs
        → _worker never reaches os._exit(rc)
          → process never dies
            → parent's _ForkedProc.wait() blocks
              → parent's nursery hangs
                → parent's async_main hangs
                  → (recurse up)

The precise new question: **what task in the 5
stuck actors' `async_main` never completes?**
Candidates:
1. shielded parent-chan `process_messages` task
   in `root_tn` — but we cancel it via
   `_parent_chan_cs.cancel()` in `Actor.cancel()`,
   which only runs during
   `open_root_actor.__aexit__`, which itself runs
   only after `async_main`'s outer unwind — which
   doesn't happen. So the shield isn't broken in
   this path.
2. `actor_nursery._join_procs.wait()` or similar
   inline in the backend `*_proc` flow.
3. `_ForkedProc.wait()` on a grandchild that DID
   exit — but pidfd_open watch didn't fire (race
   between `pidfd_open` and the child exiting?).

Most specific next probe: add DIAG around
`_ForkedProc.wait()` enter/exit to see whether
pidfd-based wait returns for every grandchild
exit. If a stuck parent's `_ForkedProc.wait()`
never returns despite its child exiting → pidfd
mechanism has a race bug under nested forkserver.

Asymmetry observed in the cascade tree: some d=0
spawners exit cleanly, others stick, even though
they started identically. Not purely depth-
determined — some race condition in nursery
teardown when multiple siblings error
simultaneously.

No code changes — diagnosis-only.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 4d05554)
Sixth and final diagnostic pass — after all 4
cascade fixes landed (FD hygiene, pidfd wait,
`_parent_chan_cs` wiring, bounded peer-clear), the
actual last gate on
`test_nested_multierrors[subint_forkserver]`
turned out to be **pytest's default
`--capture=fd` stdout/stderr capture**, not
anything in the runtime cascade.

Empirical result: `pytest -s` → test PASSES in
6.20s. Default `--capture=fd` → hangs forever.

Mechanism: pytest replaces the parent's fds 1,2
with pipe write-ends it reads from. Fork children
inherit those pipes (since `_close_inherited_fds`
correctly preserves stdio). The error-propagation
cascade in a multi-level cancel test generates
7+ actors each logging multiple `RemoteActorError`
/ `ExceptionGroup` tracebacks — enough output to
fill Linux's 64KB pipe buffer. Writes block,
subactors can't progress, processes don't exit,
`_ForkedProc.wait` hangs.

Self-critical aside: I earlier tested w/ and w/o
`-s` and both hung, concluding "capture-pipe
ruled out". That was wrong — at that time fixes
1-4 weren't all in place, so the test was
failing at deeper levels long before reaching
the "produce lots of output" phase. Once the
cascade could actually tear down cleanly, enough
output flowed to hit the pipe limit. Order-of-
operations mistake: ruling something out based
on a test that was failing for a different
reason.

Deats,
- `subint_forkserver_test_cancellation_leak_issue
  .md`: new section "Update — VERY late: pytest
  capture pipe IS the final gate" w/ DIAG timeline
  showing `trio.run` fully returns, diagnosis of
  pipe-fill mechanism, retrospective on the
  earlier wrong ruling-out, and fix direction
  (redirect subactor stdout/stderr to `/dev/null`
  in fork-child prelude, conditional on
  pytest-detection or opt-in flag)
- `tests/test_cancellation.py`: skip-mark reason
  rewritten to describe the capture-pipe gate
  specifically; cross-refs the new doc section
- `tests/spawn/test_subint_forkserver.py`: the
  orphan-SIGINT test regresses back to xfail.
  Previously passed after the FD-hygiene fix,
  but the new `wait_for_no_more_peers(
  move_on_after=3.0)` bound in `async_main`'s
  teardown added up to 3s latency, pushing
  orphan-subactor exit past the test's 10s poll
  window. Real fix: faster orphan-side teardown
  OR extend poll window to 15s

No runtime code changes in this commit — just
test-mark adjustments + doc wrap-up.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit eceed29)
(MTF-only portion: kept ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md tests/spawn/test_subint_forkserver.py)
Re-classify `test_orphaned_subactor_sigint_cleanup_DRAFT` from
flakey-env-sensitive (`strict=False` w/ "passes in isolation, flakey in
full suite") to a hard known-gap (`strict=True`) with the orphan-SIGINT
hang as the documented cause. The previous framing ("env pollution") let
the test silently pass when ordering happened to favor it; the new
framing forces an XPASS-as-FAIL the moment the underlying gap is
actually closed, so we can drop the mark intentionally instead of
accidentally.

Reason text + leading `# Known-gap test —` comment both point at
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
for the full diagnosis.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 44bdb16)
Two diagnostic gaps in `tractor.spawn._subint.subint_proc()` that hid
otherwise-silent failures, plus tracking-issue links on the two open
`subint_forkserver` follow-ups.

Deats,
- bootstrap-exc visibility: wrap the call to
  `_interpreters.exec(interp_id, bootstrap)` with
  `try/except BaseException` + `log.exception(...)`.
  * Without it, an `ImportError` / `SyntaxError` raised inside the
    dedicated driver thread goes only to Python's default thread
    excepthook — invisible to the parent, which then waits forever on
    `subint_exited.wait()`.
  * `?TODO` notes `anyio`'s `to_interpreter._interp_call` +
    `(retval, is_exception)` pattern as the next step for re-raising;
    skipped now bc it must coordinate with the `trio.Cancelled` paths
    around the existing `.wait()` calls.

- cancel-leak disambiguation: when the driver thread doesn't exit within
  `_HARD_KILL_TIMEOUT`, also log `_interpreters.is_running(interp_id)`
  as `subint_still_running=...` so the operator can tell "thread leaked,
  subint already done" apart from "thread alive bc subint is wedged".
  * pattern borrowed from `trio-parallel`'s `_sint.SintWorker.is_alive()`.

- `?TODO` near the `bootstrap` literal: future switch to
  `_interpreters.set___main___attrs()` — same API `anyio`
  uses in `to_interpreter._Worker.call()` — for passing
  non-`repr()`-roundtrippable values (`SpawnSpec` struct, callables,
  etc).
  * add cross-refs tracking issue `#379`.

Also,
- `Tracked at: [#449]` link on
  `subint_forkserver_test_cancellation_leak_issue.md`.
- `Tracked at: [#450]` link on
  `subint_forkserver_thread_constraints_on_pep684_issue.md`.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 5456195)
Major expansion of the module docstring. Code is
unchanged; this lands the architectural reasoning that
was previously implicit, plus the POSIX/trio fork
mechanics the design relies on.

New sections:
- "Design rationale" — answers two implicit questions:
  (1) why a forkserver pattern at all (vs. forking
  directly from a trio task), (2) why in-process (vs.
  stdlib `mp.forkserver`'s sidecar process). Documents
  the three costs the in-process design avoids
  (sidecar lifecycle, per-spawn IPC, cold-start child)
  and the tradeoffs we accept in exchange (3.14-only,
  heavier than `to_thread.run_sync`).
- "Implementation status" — clarifies what's actually
  landed today vs. the envisioned arch: parent's
  `trio.run()` still lives on main interp (subint-
  hosted root gated on msgspec/msgspec#1026). Names
  why the "subint" prefix is correct anyway — same PR
  series as `_subint.py` / `_subint_fork.py`.
- "What survives the fork? — POSIX semantics" — POSIX
  preserves only the calling thread, so the
  `trio.run()` thread is gone in the child. Includes
  a small parent/child thread-survival table and
  covers the four artifact classes that DO cross the
  fork boundary (inherited fds, COW memory, Python
  thread state, user-level locks) and how each is
  handled.
- "FYI: how this dodges the `trio.run()` × `fork()`
  hazards" — itemizes each class of trio process-
  global state (wakeup-fd, `epoll`/`kqueue`,
  threadpool, cancel scopes / nurseries, `atexit`,
  foreign-language I/O) and explains how the
  forkserver-thread design avoids each.

Also,
- bump the gated msgspec issue link from
  `msgspec/msgspec#563` to `msgspec/msgspec#1026` (the
  PEP 684 isolated-mode tracker).

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3ab99d5)
Adds a "Future arch — what subints would buy us" section to
the module docstring, complementing the prior commit's
current-state rationale. Code is unchanged.

Frames the `subint` prefix as family-naming today (no actual
subinterp is created yet), then lays out the three concrete
wins that land once msgspec/msgspec#1026 unblocks PEP 684
isolated-mode subints:

- Cheaper forks — moving the parent's `trio.run()` into a
  subint shrinks the main-interp COW image the child inherits.
  The main interp becomes the literal forkserver: an
  intentionally-empty execution ctx whose only job is to call
  `os.fork()` cleanly.

- True parallelism — per-interp GIL means the forkserver
  thread on main and the trio thread on subint actually run in
  parallel. Spawn latency stops stalling the trio loop.

- Multi-actor-per-process — the architectural payoff. With
  per-interp-GIL subints, one process can host main + N
  subint-resident actor `trio.run()`s, and `os.fork()` reverts
  to the last-resort spawn (only when OS-level isolation is
  actually needed). Joins the story with the in-thread
  `_subint.py` backend: `subint` → in-process spawn,
  `subint_forkserver` → cross-process when a real OS boundary
  is required.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 4b5176e)
Move the truly-generic main-interp-worker-thread fork primitives
(`fork_from_worker_thread`, `_close_inherited_fds`, `_ForkedProc`,
`wait_child`, `_format_child_exit`) out of `_subint_forkserver.py` into
a sibling `_main_thread_forkserver.py` module so the primitive layer is
honestly named — none of these helpers touch a subint, they just fork
from a main-interp worker thread.

`_subint_forkserver.py` keeps its public surface intact via re-export so
any existing `from tractor.spawn._subint_forkserver import ...` callsite
still resolves.

Net: zero behavior change, preps the way for the upcoming spawn-method
key split where `main_thread_forkserver` ships as the working backend
and `subint_forkserver` becomes reserved for the future
subint-isolated-child variant (gated on msgspec/msgspec#1026).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 99dade0)
The `subint_forkserver` name was always aspirational —
today's impl forks from a regular main-interp worker
thread and the child runs trio on its own main interp;
NO subinterp anywhere in parent or child. Splitting the
backend into two clearly-named variants drops the lie:

- **variant 1** — `main_thread_forkserver` (the working
  impl). New `SpawnMethodKey` literal + `_methods`
  dispatch entry + `_runtime.Actor._from_parent()`
  match-arm. The spawn-coro `subint_forkserver_proc`
  moves to `_main_thread_forkserver` and is renamed
  `main_thread_forkserver_proc()`.

- **variant 2** — `subint_forkserver` (future, reserved).
  Module shrinks to a placeholder describing the
  variant-2 design (subint-isolated child runtime, gated
  on msgspec/msgspec#1026 + PEP 684). Today the legacy
  `'subint_forkserver'` key aliases to
  `main_thread_forkserver_proc` so existing
  `--spawn-backend=subint_forkserver` invocations keep
  working; flipped to a `NotImplementedError` stub in a
  follow-up.

Deats,
- `Actor._from_parent()` spawn-method gate now accepts
  both `'main_thread_forkserver'` and
  `'subint_forkserver'` (both go through the
  IPC-`SpawnSpec` path).
- the variant-1 spawn-coro stamps its own `SpawnSpec` /
  log lines with `spawn_method='main_thread_forkserver'`
  so subactor renders reflect the actual mechanism.
- docstring reorg: trio×fork hazard breakdown, POSIX
  fork-survival semantics, in-process-vs-stdlib
  forkserver design notes, and the TODO/cleanup section
  all move from `_subint_forkserver` to
  `_main_thread_forkserver` (lives with the working
  code). `_subint_forkserver` keeps a tight forward-
  looking doc that motivates the reserved key.
- `run_subint_in_worker_thread()` stays in
  `_subint_forkserver` as the companion primitive — it's
  the subint counterpart to `fork_from_worker_thread()`
  and will plug into the future variant-2 spawn-coro.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 57dae0e)
Reduce `_subint_forkserver.py` to its variant-2 placeholder shape:

- Add `subint_forkserver_proc` async stub raising `NotImplementedError`
  with a redirect msg pointing at the working variant-1 backend
  (`main_thread_forkserver`), msgspec/msgspec#1026 (upstream PEP 684
  blocker), and #379 (subint umbrella).

- `tractor.spawn._spawn._methods['subint_forkserver']` now dispatches to
  the stub instead of aliasing the variant-1 coroutine
  — `--spawn-backend=subint_forkserver` errors cleanly.

- Drop now-dead module-scope: `ChildSigintMode`
  / `_DEFAULT_CHILD_SIGINT` defs, `_has_subints` try/except (replaced
  with import from `._subint`), unused imports (`partial`, `Literal`,
  `sys`, msgtypes/pretty_struct, `current_actor`,
  `cancel_on_completion`/`soft_kill`, `_server` TYPE_CHECKING).

- Backward-compat re-exports of fork primitives kept until the follow-up
  commit migrates external test imports.

- `tests/spawn/test_subint_forkserver.py::forkserver_spawn_method`
  fixture: flip hardcoded `'subint_forkserver'`
  → `'main_thread_forkserver'` so the test still exercises the working
  backend (full file rename comes in the test-import migration commit).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 5e83881)
Rename `tests/spawn/test_subint_forkserver.py` →
`test_main_thread_forkserver.py` and migrate its imports +
internal refs to the new canonical names:

- `fork_from_worker_thread`, `wait_child` → from
  `tractor.spawn._main_thread_forkserver`.
- `run_subint_in_worker_thread` → still from `_subint_forkserver`
  (variant-2 primitive).
- Module docstring + tier-3 fixture + the `*_spawn_basic` test fn
  renamed for variant-1-honesty.
- Orphan-harness subprocess argv flipped from `'subint_forkserver'`
  → `'main_thread_forkserver'`.

`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` imports split
the same way.

`tractor/spawn/_subint_forkserver.py` drops the backward- compat
re-exports of the fork primitives — the only consumers (test file
+ smoketest) now import from `_main_thread_forkserver` directly.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 9f0709e)
Add `test_subint_forkserver_key_errors_cleanly` — a tn-tier
regression guard that pins down the variant-2 reservation
contract: the `'subint_forkserver'` key in
`_spawn._methods` MUST raise `NotImplementedError` today,
not silently dispatch to `main_thread_forkserver_proc`.

The transient alias-state existed briefly during the rename
(commit `57dae0e4`'s "Split forkserver backend into variant
1/2 mods" landed the alias; `5e83881f` flipped it to the
stub). Without a guard, a future refactor could easily
re-collapse the two keys back to a single coro and silently
break the variant-1 / variant-2 contract.

Also asserts the stub's error msg surfaces the two pointers
an operator hitting it actually needs:

- `'main_thread_forkserver'` — the working backend they
  prolly meant,
- `'msgspec#1026'` — the upstream blocker that has to land
  before variant-2 can ship.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit cbdf1eb)
Two cleanup tweaks in `_main_thread_forkserver`:

Doc, "what survives the fork?" section — expand the
"non-calling threads are gone in the child" claim with
the precise execution-vs-memory split that reconciles
this module's prior framing with trio's (canonical
[python-trio/trio#1614][trio-1614]) "leaked stacks"
framing:

- execution-side: only the calling thread runs
  post-fork; all others never execute another
  instruction.
- memory-side: those non-running threads' stacks +
  per-thread heap structures are still COW-inherited
  as orphaned bytes — what trio means by "leaked".

Same POSIX reality, opposite sides; the table is
extended to a 4-col `parent | child (executing) |
child (memory)` layout to make both views explicit.
Also blank-line-padded the bulleted hazard classes
for cleaner markdown rendering.

[trio-1614]: python-trio/trio#1614

Code, `_close_inherited_fds()` log noise — split the
catch-all `except OSError` into:

- `EBADF` — benign race where the dirfd that
  `os.listdir('/proc/self/fd')` itself opened ends up
  in `candidates`, then auto-closes before the loop
  reaches it. Demote to `log.debug()` + `continue`;
  prior `log.exception` drowned the post-fork log
  channel with stack traces every spawn.
- other errnos (EIO / EPERM / EINTR / ...) keep the
  loud `log.exception` surface — those ARE genuinely
  unexpected.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 8c73019)
`main_thread_forkserver` doesn't actually need py3.14
`concurrent.interpreters` (PEP 734) — it forks from a
non-trio worker thread and runs `_trio_main` in the child,
same shape as `trio_proc`. The previous `_has_subints`
gate + subint-family `case` arm were a copy-paste error.

In `tractor.spawn._main_thread_forkserver`,
- drop the `_has_subints` import + the `RuntimeError`
  raise in `main_thread_forkserver_proc()`.
- drop the now-unused `import sys` (only used by the
  prior error msg).

In `tractor.spawn._spawn.try_set_start_method()`,
- pull `'main_thread_forkserver'` out of the subint-
  family arm (which still gates on `_has_subints`).
- merge it into the `'trio'` arm — both set `_ctx = None`
  bc neither needs an `mp.context`.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit fc5e80f)
Document the ~0.3% rotating `trio.TooSlowError`
flake under `--spawn-backend=main_thread_forkserver`
full-suite runs. Root cause: `hard_kill`'s per-sub
1.6s graceful timeout compounding across N subactors
in a cancel cascade, plus cumulative autouse-reaper
teardown overhead.

Covers symptom, observed flaking tests, root-cause
family, ranked mitigations (cap bump -> CPU-count-
aware cap -> `pytest-rerunfailures` -> `hard_kill`
tuning -> targeted profiling), and a verification
protocol.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 60ce713)
Sends `SIGTERM` (graceful shutdown) instead of the existing `kill()`
which sends `SIGKILL`. Mirrors the `trio.Process.terminate()`
/ `multiprocessing.Process.terminate()` interface.

Used by `ActorNursery.cancel()`'s per-child escalation when
`Portal.cancel_actor()` raises `ActorTooSlowError`, and by the legacy
`hard_kill=True` branch. Swallows `ProcessLookupError` (child already
dead) same as `kill()`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 4c00913)
Race `IPCServer.wait_for_peer(uid)` against the sub-proc's
`.wait()` inside a `trio` nursery; whichever completes first
cancels the other.

Prevents the spawning task from parking forever on an unsignalled
`_peer_connected[uid]` event when a sub-actor dies during boot
(e.g. crashed on import before reaching `_actor_child_main`).
Instead of hanging, raises `ActorFailure` w/ the proc's exit code
for clean supervisor error reporting.

Also,
- use the new racer in `main_thread_forkserver_proc()` spawn path.
- keep `proc_wait` generic so each backend passes its own callable
  (`trio.Process.wait`, `_ForkedProc.wait`, etc.).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3b0724e)
Comment/docstring updates: `subint_forkserver` is a clean
`NotImplementedError` stub — not an alias to variant-1
(`main_thread_forkserver`). Key reserved in-place (not aliased) so
the subint-hosted-child impl can flip without API churn once
msgspec/msgspec#1026 unblocks PEP 684 subints.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7d1e446)
Add `capture` dimension to CI matrix so fork-based
backends run `--capture=sys` (fork-child × `--capture=fd`
is a known deadlock). Non-fork backends keep `fd`.

Deats,
- two `include:` rows for `main_thread_forkserver` on
  linux py3.13: tcp + uds, both `capture: 'sys'`
- job name updated to show `capture=` mode
- timeout bumped 16 -> 20 min to accommodate the
  additional matrix cells
- `--capture=${{ matrix.capture }}` replaces hardcoded
  `--capture=fd`

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit a24600f)
Append "Snapshot evidence (2026-05-13)" section to
`cancel_cascade_too_slow_under_main_thread_forkserver_issue.md`
documenting `fail_after_w_trace` diag capture results for
`test_nested_multierrors` under the MTF backend — reproduction cmd,
ptree analysis, observed hang signature, and updated triage plan.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 5372fd1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Roll our own os.fork() approach (at least on linux?)

2 participants