Add main_thread_forkserver backend#463
Open
goodboy wants to merge 53 commits into
Open
Conversation
fbaf933 to
9584bc7
Compare
92bdc97 to
7387bde
Compare
10 tasks
There was a problem hiding this comment.
Pull request overview
This PR adds new spawn backends to tractor’s spawning subsystem, centered on a fork-without-exec path (main_thread_forkserver) and a Python 3.14+ subinterpreter-based backend (subint), plus supporting tests and CI matrix adjustments to exercise fork-based backends safely.
Changes:
- Add
main_thread_forkserverspawn backend (fork from a clean main-interpreter worker thread) and supporting fork primitives/shims. - Add
subintspawn backend (legacy-config subinterpreters) and reservesubint_forkserver/subint_forkas explicit placeholders/stubs. - Add
tests/spawncoverage and update CI to use--capture=sysfor fork-based backends to avoid knownfork × --capture=fddeadlocks.
Reviewed changes
Copilot reviewed 25 out of 26 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tractor/spawn/_main_thread_forkserver.py |
Implements the new forkserver backend and fork/wait shims. |
tractor/spawn/_subint.py |
Implements the subint backend using _interpreters (py3.14+ gate). |
tractor/spawn/_subint_forkserver.py |
Variant-2 placeholder: reserved key that currently raises NotImplementedError. |
tractor/spawn/_subint_fork.py |
Explicit stub for the CPython-blocked fork-from-subinterpreter attempt. |
tractor/spawn/_spawn.py |
Registers new spawn-method keys and adds wait_for_peer_or_proc_death() helper. |
tractor/runtime/_runtime.py |
Routes forkserver keys through the SpawnSpec handshake path. |
tests/spawn/test_subint_cancellation.py |
Adds cancellation/teardown audit tests for the subint backend. |
tests/spawn/test_main_thread_forkserver.py |
Adds integration tests for the new forkserver backend and stubs. |
.github/workflows/ci.yml |
Adds matrix rows + capture-mode switching for fork-based backends. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+297
to
+301
| driver_thread.start() | ||
|
|
||
| try: | ||
| event, chan = await ipc_server.wait_for_peer(uid) | ||
| except trio.Cancelled: |
Comment on lines
+22
to
+27
| Spawn-method key: `'main_thread_forkserver'`. The legacy | ||
| `'subint_forkserver'` key currently aliases here too — see | ||
| `tractor.spawn._subint_forkserver` for the future variant-2 | ||
| (subint-isolated-child runtime, gated on | ||
| [jcrist/msgspec#1026](https://github.com/jcrist/msgspec/issues/1026)) | ||
| that key is reserved for. |
Comment on lines
+224
to
+232
| # `concurrent.interpreters` wrapper (PEP 734). See | ||
| # `tractor.spawn._subint` for the detailed | ||
| # reasoning. `subint_fork` is blocked at the | ||
| # CPython level (raises `NotImplementedError`); | ||
| # `main_thread_forkserver` is the working | ||
| # variant-1 backend; `subint_forkserver` aliases | ||
| # to it today, reserved for the future variant-2 | ||
| # subint-isolated-child runtime once upstream | ||
| # msgspec#1026 unblocks. |
Comment on lines
+174
to
+177
| @pytest.mark.timeout( | ||
| 3, # NOTE never passes pre-3.14+ subints support. | ||
| method='thread', | ||
| ) |
Comment on lines
+572
to
+575
| # step 4+5: SIGINT the orphan, poll for exit. | ||
| os.kill(child_pid, signal.SIGINT) | ||
| timeout: float = 6.0 | ||
| cleanup_deadline: float = time.monotonic() + timeout |
7387bde to
ee79939
Compare
1 task
ee79939 to
3b022d4
Compare
e86b5d4 to
beca584
Compare
beca584 to
8142fec
Compare
8142fec to
7f6782b
Compare
6 tasks
goodboy
added a commit
that referenced
this pull request
Jun 28, 2026
`run_in_actor()` is slated to become a hilevel wrapper (`runtime/_supervise.py` "TODO: DEPRECATE THIS"), so the showcase `we_are_processes.py` shouldn't lead with it. Move it to the modern API: each `worker_<i>` subactor runs a `@tractor.context` `endpoint()` that `started()`-hands its name + pid back over `Portal.open_context()` and parks; the root sleeps then raises on purpose so the runtime reaps the whole tree (zero zombies). The subs spawn concurrently from bg `trio.Task`s so each child's cold `import tractor` (~0.4s, see #470) overlaps instead of stacking; a comment flags the coming `main_thread_forkserver` backend (#463) which'll make serial spawns cheap enough to just loop. Match the landing prose to the snippet — name the `Context` + `started()` handshake it now leads with. Also, document `--watch examples` on the `sphinx-autobuild` cmds so edits to `literalinclude`-d example scripts (which live outside `docs/`) live-reload too. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
goodboy
added a commit
that referenced
this pull request
Jun 28, 2026
`run_in_actor()` is slated to become a hilevel wrapper (`runtime/_supervise.py` "TODO: DEPRECATE THIS"), so the showcase `we_are_processes.py` shouldn't lead with it. Move it to the modern API: each `worker_<i>` subactor runs a `@tractor.context` `endpoint()` that `started()`-hands its name + pid back over `Portal.open_context()` and parks; the root sleeps then raises on purpose so the runtime reaps the whole tree (zero zombies). The subs spawn concurrently from bg `trio.Task`s so each child's cold `import tractor` (~0.4s, see #470) overlaps instead of stacking; a comment flags the coming `main_thread_forkserver` backend (#463) which'll make serial spawns cheap enough to just loop. Match the landing prose to the snippet — name the `Context` + `started()` handshake it now leads with. Also, document `--watch examples` on the `sphinx-autobuild` cmds so edits to `literalinclude`-d example scripts (which live outside `docs/`) live-reload too. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Follow-up to 72d1b90 (was prev commit adding `debug_mode` for `subint_forkserver`): that commit wired the runtime-side `subint_forkserver` SpawnSpec-recv gate in `Actor._from_parent`, but the `subint_forkserver_proc` child-target was still passing `spawn_method='trio'` to `_trio_main` — so `Actor.pformat()` / log lines would report the subactor as plain `'trio'` instead of the actual parent-side spawn mechanism. Flip the label to `'subint_forkserver'`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit e31eb8d)
Also, some slight touchups in `.spawn._subint`. (cherry picked from commit 1e357dc)
New `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md` captures a descendant-leak surfaced while wiring `subint_forkserver` into the full test matrix: running `tests/test_cancellation.py` under `--spawn-backend=subint_forkserver` reproducibly leaks **exactly 5** `subint-forkserv` comm-named child processes that survive session exit, each holding a `LISTEN` on `:1616` (the tractor default registry addr) — and therefore poisons every subsequent test session that defaults to that addr. Deats, - TL;DR + ruled-out checks confirming the procs are ours (not piker / other tractor-embedding apps) — `/proc/$pid/cmdline` + cwd both resolve to this repo's `py314/` venv - root cause: `_ForkedProc.kill()` is PID-scoped (plain `os.kill(SIGKILL)` to the direct child), not tree-scoped — grandchildren spawned during a multi-level cancel test get reparented to init and inherit the registry listen socket - proposed fix directions ranked: (1) put each forkserver-spawned subactor in its own process- group (`os.setpgrp()` in fork-child) + tree-kill via `os.killpg(pgid, SIGKILL)` on teardown, (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit `/proc/<pid>/task/*/children` walk. Vote: (1) — POSIX-standard, aligns w/ `start_new_session=True` semantics in `subprocess.Popen` / trio's `open_process` - inline reproducer + cleanup recipe scoped to `$(pwd)/py314/bin/python.*pytest.*spawn-backend= subint_forkserver` so cleanup doesn't false-flag unrelated tractor procs (consistent w/ `run-tests` skill's zombie-check guidance) Stopgap hygiene fix (wiring `reg_addr` through the 5 leaky tests in `test_cancellation.py`) is incoming as a follow-up — that one stops the blast radius, but zombies still accumulate per-run until the real tree-kill fix lands. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit e3f4f5a)
Major rewrite of `subint_forkserver_test_cancellation_leak_issue.md` after empirical investigation revealed the earlier "descendant-leak + missing tree-kill" diagnosis conflated two unrelated symptoms: 1. **5-zombie leak holding `:1616`** — turned out to be a self-inflicted cleanup bug: `pkill`-ing a bg pytest task (SIGTERM/SIGKILL, no SIGINT) skipped the SC graceful cancel cascade entirely. Codified the real fix — SIGINT-first ladder w/ bounded wait before SIGKILL — in e5e2afb (`run-tests` SKILL) and `feedback_sc_graceful_cancel_first.md`. 2. **`test_nested_multierrors[subint_forkserver]` hangs indefinitely** — the actual backend bug, and it's a deadlock not a leak. Deats, - new diagnosis: all 5 procs are kernel-`S` in `do_epoll_wait`; pytest-main's trio-cache workers are in `os.waitpid` waiting for children that are themselves waiting on IPC that never arrives — graceful `Portal.cancel_actor` cascade never reaches its targets - tree-structure evidence: asymmetric depth across two identical `run_in_actor` calls — child 1 (3 threads) spawns both its grandchildren; child 2 (1 thread) never completes its first nursery `run_in_actor`. Smells like a race on fork- inherited state landing differently per spawn ordering - new hypothesis: `os.fork()` from a subactor inherits the ROOT parent's IPC listener FDs transitively. Grandchildren end up with three overlapping FD sets (own + direct-parent + root), so IPC routing becomes ambiguous. Predicts bug scales with fork depth — matches reality: single- level spawn works, multi-level hangs - ruled out: `_ForkedProc.kill()` tree-kill (never reaches hard-kill path), `:1616` contention (fixed by `reg_addr` fixture wiring), GIL starvation (each subactor has its own OS process+GIL), child-side KBI absorption (`_trio_main` only catches KBI at `trio.run()` callsite, reached only on trio-loop exit) - four fix directions ranked: (1) blanket post-fork `closerange()`, (2) `FD_CLOEXEC` + audit, (3) targeted FD cleanup via `actor.ipc_server` handle, (4) `os.posix_spawn` w/ `file_actions`. Vote: (3) — surgical, doesn't break the "no exec" design of `subint_forkserver` - standalone repro added (`spawn_and_error(breadth= 2, depth=1)` under `trio.fail_after(20)`) - stopgap: skip `test_nested_multierrors` + multi- level-spawn tests under the backend via `@pytest.mark.skipon_spawn_backend(...)` until fix lands Killing the "tree-kill descendants" fix-direction section: it addressed a bug that didn't exist. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 35da808)
Implements fix-direction (1)/blunt-close-all-FDs from b71705b (`subint_forkserver` nested-cancel hang diag), targeting the multi-level cancel-cascade deadlock in `test_nested_multierrors[subint_forkserver]`. The diagnosis doc voted for surgical FD cleanup via `actor.ipc_server` handle as the cleanest approach, but going blunt is actually the right call: after `os.fork()`, the child immediately enters `_actor_child_main()` which opens its OWN IPC sockets / wakeup-fd / epoll-fd / etc. — none of the parent's FDs are needed. Closing everything except stdio is safe AND defends against future listener/IPC additions to the parent inheriting silently into children. Deats, - new `_close_inherited_fds(keep={0,1,2}) -> int` helper. Linux fast-path enumerates `/proc/self/fd`; POSIX fallback uses `RLIMIT_NOFILE` range. Matches the stdlib `subprocess._posixsubprocess.close_fds` strategy. Returns close-count for sanity logging - wire into `fork_from_worker_thread._worker()`'s post-fork child prelude — runs immediately after the pid-pipe `os.close(rfd/wfd)`, before the user `child_target` callable executes - docstring cross-refs the diagnosis doc + spells out the FD-inheritance-cascade mechanism and why the close-all approach is safe for our spawn shape Validation pending: re-run `test_nested_multierrors[subint_forkserver]` to confirm the deadlock is gone. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 9993db0)
Two coordinated improvements to the `subint_forkserver` backend: 1. Replace `trio.to_thread.run_sync(os.waitpid, ..., abandon_on_cancel=False)` in `_ForkedProc.wait()` with `trio.lowlevel.wait_readable(pidfd)`. The prior version blocked a trio cache thread on a sync syscall — outer cancel scopes couldn't unwedge it when something downstream got stuck. Same pattern `trio.Process.wait()` and `proc_waiter` (the mp backend) already use. 2. Drop the `@pytest.mark.xfail(strict=True)` from `test_orphaned_subactor_sigint_cleanup_DRAFT` — the test now PASSES after 0cd0b63 (fork-child FD scrub). Same root cause as the nested-cancel hang: inherited IPC/trio FDs were poisoning the child's event loop. Closing them lets SIGINT propagation work as designed. Deats, - `_ForkedProc.__init__` opens a pidfd via `os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+) - `wait()` parks on `trio.lowlevel.wait_readable()`, then non-blocking `waitpid(WNOHANG)` to collect the exit status (correct since the pidfd signal IS the child-exit notification) - `ChildProcessError` swallow handles the rare race where someone else reaps first - pidfd closed after `wait()` completes (one-shot semantics) + `__del__` belt-and-braces for unexpected-teardown paths - test docstring's `@xfail` block replaced with a `# NOTE` comment explaining the historical context + cross-ref to the conc-anal doc; test remains in place as a regression guard The two changes are interdependent — the cancellable `wait()` matters for the same nested- cancel scenarios the FD scrub fixes, since the original deadlock had trio cache workers wedged in `os.waitpid` swallowing the outer cancel. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit c20b05e)
Two new sections in `subint_forkserver_test_cancellation_leak_issue.md` documenting continued investigation of the `test_nested_multierrors[subint_forkserver]` peer- channel-loop hang: 1. **"Attempted fix (DID NOT work) — hypothesis (3)"**: tried sync-closing peer channels' raw socket fds from `_serve_ipc_eps`'s finally block (iterate `server._peers`, `_chan._transport. stream.socket.close()`). Theory was that sync close would propagate as `EBADF` / `ClosedResourceError` into the stuck `recv_some()` and unblock it. Result: identical hang. Either trio holds an internal fd reference that survives external close, or the stuck recv isn't even the root blocker. Either way: ruled out, experiment reverted, skip-mark restored. 2. **"Aside: `-s` flag changes behavior for peer- intensive tests"**: noticed `test_context_stream_semantics.py` under `subint_forkserver` hangs with default `--capture=fd` but passes with `-s` (`--capture=no`). Working hypothesis: subactors inherit pytest's capture pipe (fds 1,2 — which `_close_inherited_fds` deliberately preserves); verbose subactor logging fills the buffer, writes block, deadlock. Fix direction (if confirmed): redirect subactor stdout/stderr to `/dev/null` or a file in `_actor_child_main`. Not a blocker on the main investigation; deserves its own mini-tracker. Both sections are diagnosis-only — no code changes in this commit. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 7cd47ef)
Three places that previously swallowed exceptions silently now log via `log.exception()` so they surface in the runtime log when something weird happens — easier to track down sneaky failures in the fork-from-worker-thread / subint-bootstrap primitives. Deats, - `_close_inherited_fds()`: post-fork child's per-fd `os.close()` swallow now logs the fd that failed to close. The comment notes the expected failure modes (already-closed-via-listdir-race, otherwise-unclosable) — both still fine to ignore semantically, but worth flagging in the log. - `fork_from_worker_thread()` parent-side timeout branch: the `os.close(rfd)` + `os.close(wfd)` cleanup now logs each pipe-fd close failure separately before raising the `worker thread didn't return` RuntimeError. - `run_subint_in_worker_thread._drive()`: when `_interpreters.exec(interp_id, bootstrap)` raises a `BaseException`, log the full call signature (interp_id + bootstrap) along with the captured exception, before stashing into `err` for the outer caller. Behavior unchanged — only adds observability. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 458a35c)
Third diagnostic pass on
`test_nested_multierrors[subint_forkserver]` hang.
Two prior hypotheses ruled out + a new, more
specific deadlock shape identified.
Ruled out,
- **capture-pipe fill** (`-s` flag changes test):
retested explicitly — `test_nested_multierrors`
hangs identically with and without `-s`. The
earlier observation was likely a competing
pytest process I had running in another session
holding registry state
- **stuck peer-chan recv that cancel can't
break**: pivot from the prior pass. With
`handle_stream_from_peer` instrumented at ENTER
/ `except trio.Cancelled:` / finally: 40
ENTERs, ZERO `trio.Cancelled` hits. Cancel never
reaches those tasks at all — the recvs are
fine, nothing is telling them to stop
Actual deadlock shape: multi-level mutual wait.
root blocks on spawner.wait()
spawner blocks on grandchild.wait()
grandchild blocks on errorer.wait()
errorer Actor.cancel() ran, but proc
never exits
`Actor.cancel()` fired in 12 PIDs — but NOT in
root + 2 direct spawners. Those 3 have peer
handlers stuck because their own `Actor.cancel()`
never runs, which only runs when the enclosing
`tractor.open_nursery()` exits, which waits on
`_ForkedProc.wait()` for the child pidfd to
signal, which only signals when the child
process fully exits.
Refined question: **why does an errorer process
not exit after its `Actor.cancel()` completes?**
Three hypotheses (unverified):
1. `_parent_chan_cs.cancel()` fires but the
shielded loop's recv is stuck in a way cancel
still can't break
2. `async_main`'s post-cancel unwind has other
tasks in `root_tn` awaiting something that
never arrives (e.g. outbound IPC reply)
3. `os._exit(rc)` in `_worker` never runs because
`_child_target` never returns
Next-session probes (priority order):
1. instrument `_worker`'s fork-child branch —
confirm whether `child_target()` returns /
`os._exit(rc)` is reached for errorer PIDs
2. instrument `async_main`'s final unwind — see
which await in teardown doesn't complete
3. compare under `trio_proc` backend at the
equivalent level to spot divergence
No code changes — diagnosis-only.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit ab86f76)
Fourth diagnostic pass — instrument `_worker`'s
fork-child branch (`pre child_target()` / `child_
target RETURNED rc=N` / `about to os._exit(rc)`)
and `_trio_main` boundaries (`about to trio.run` /
`trio.run RETURNED NORMALLY` / `FINALLY`). Test
config: depth=1/breadth=2 = 1 root + 14 forked =
15 actors total.
Fresh-run results,
- **9 processes complete the full flow**:
`trio.run RETURNED NORMALLY` → `child_target
RETURNED rc=0` → `os._exit(0)`. These are tree
LEAVES (errorers) plus their direct parents
(depth-0 spawners) — they actually exit
- **5 processes stuck INSIDE `trio.run(trio_
main)`**: hit "about to trio.run" but never
see "trio.run RETURNED NORMALLY". These are
root + top-level spawners + one intermediate
The deadlock is in `async_main` itself, NOT the
peer-channel loops. Specifically, the outer
`async with root_tn:` in `async_main` never exits
for the 5 stuck actors, so the cascade wedges:
trio.run never returns
→ _trio_main finally never runs
→ _worker never reaches os._exit(rc)
→ process never dies
→ parent's _ForkedProc.wait() blocks
→ parent's nursery hangs
→ parent's async_main hangs
→ (recurse up)
The precise new question: **what task in the 5
stuck actors' `async_main` never completes?**
Candidates:
1. shielded parent-chan `process_messages` task
in `root_tn` — but we cancel it via
`_parent_chan_cs.cancel()` in `Actor.cancel()`,
which only runs during
`open_root_actor.__aexit__`, which itself runs
only after `async_main`'s outer unwind — which
doesn't happen. So the shield isn't broken in
this path.
2. `actor_nursery._join_procs.wait()` or similar
inline in the backend `*_proc` flow.
3. `_ForkedProc.wait()` on a grandchild that DID
exit — but pidfd_open watch didn't fire (race
between `pidfd_open` and the child exiting?).
Most specific next probe: add DIAG around
`_ForkedProc.wait()` enter/exit to see whether
pidfd-based wait returns for every grandchild
exit. If a stuck parent's `_ForkedProc.wait()`
never returns despite its child exiting → pidfd
mechanism has a race bug under nested forkserver.
Asymmetry observed in the cascade tree: some d=0
spawners exit cleanly, others stick, even though
they started identically. Not purely depth-
determined — some race condition in nursery
teardown when multiple siblings error
simultaneously.
No code changes — diagnosis-only.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 4d05554)
Sixth and final diagnostic pass — after all 4 cascade fixes landed (FD hygiene, pidfd wait, `_parent_chan_cs` wiring, bounded peer-clear), the actual last gate on `test_nested_multierrors[subint_forkserver]` turned out to be **pytest's default `--capture=fd` stdout/stderr capture**, not anything in the runtime cascade. Empirical result: `pytest -s` → test PASSES in 6.20s. Default `--capture=fd` → hangs forever. Mechanism: pytest replaces the parent's fds 1,2 with pipe write-ends it reads from. Fork children inherit those pipes (since `_close_inherited_fds` correctly preserves stdio). The error-propagation cascade in a multi-level cancel test generates 7+ actors each logging multiple `RemoteActorError` / `ExceptionGroup` tracebacks — enough output to fill Linux's 64KB pipe buffer. Writes block, subactors can't progress, processes don't exit, `_ForkedProc.wait` hangs. Self-critical aside: I earlier tested w/ and w/o `-s` and both hung, concluding "capture-pipe ruled out". That was wrong — at that time fixes 1-4 weren't all in place, so the test was failing at deeper levels long before reaching the "produce lots of output" phase. Once the cascade could actually tear down cleanly, enough output flowed to hit the pipe limit. Order-of- operations mistake: ruling something out based on a test that was failing for a different reason. Deats, - `subint_forkserver_test_cancellation_leak_issue .md`: new section "Update — VERY late: pytest capture pipe IS the final gate" w/ DIAG timeline showing `trio.run` fully returns, diagnosis of pipe-fill mechanism, retrospective on the earlier wrong ruling-out, and fix direction (redirect subactor stdout/stderr to `/dev/null` in fork-child prelude, conditional on pytest-detection or opt-in flag) - `tests/test_cancellation.py`: skip-mark reason rewritten to describe the capture-pipe gate specifically; cross-refs the new doc section - `tests/spawn/test_subint_forkserver.py`: the orphan-SIGINT test regresses back to xfail. Previously passed after the FD-hygiene fix, but the new `wait_for_no_more_peers( move_on_after=3.0)` bound in `async_main`'s teardown added up to 3s latency, pushing orphan-subactor exit past the test's 10s poll window. Real fix: faster orphan-side teardown OR extend poll window to 15s No runtime code changes in this commit — just test-mark adjustments + doc wrap-up. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit eceed29) (MTF-only portion: kept ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md tests/spawn/test_subint_forkserver.py)
Re-classify `test_orphaned_subactor_sigint_cleanup_DRAFT` from
flakey-env-sensitive (`strict=False` w/ "passes in isolation, flakey in
full suite") to a hard known-gap (`strict=True`) with the orphan-SIGINT
hang as the documented cause. The previous framing ("env pollution") let
the test silently pass when ordering happened to favor it; the new
framing forces an XPASS-as-FAIL the moment the underlying gap is
actually closed, so we can drop the mark intentionally instead of
accidentally.
Reason text + leading `# Known-gap test —` comment both point at
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
for the full diagnosis.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 44bdb16)
Two diagnostic gaps in `tractor.spawn._subint.subint_proc()` that hid
otherwise-silent failures, plus tracking-issue links on the two open
`subint_forkserver` follow-ups.
Deats,
- bootstrap-exc visibility: wrap the call to
`_interpreters.exec(interp_id, bootstrap)` with
`try/except BaseException` + `log.exception(...)`.
* Without it, an `ImportError` / `SyntaxError` raised inside the
dedicated driver thread goes only to Python's default thread
excepthook — invisible to the parent, which then waits forever on
`subint_exited.wait()`.
* `?TODO` notes `anyio`'s `to_interpreter._interp_call` +
`(retval, is_exception)` pattern as the next step for re-raising;
skipped now bc it must coordinate with the `trio.Cancelled` paths
around the existing `.wait()` calls.
- cancel-leak disambiguation: when the driver thread doesn't exit within
`_HARD_KILL_TIMEOUT`, also log `_interpreters.is_running(interp_id)`
as `subint_still_running=...` so the operator can tell "thread leaked,
subint already done" apart from "thread alive bc subint is wedged".
* pattern borrowed from `trio-parallel`'s `_sint.SintWorker.is_alive()`.
- `?TODO` near the `bootstrap` literal: future switch to
`_interpreters.set___main___attrs()` — same API `anyio`
uses in `to_interpreter._Worker.call()` — for passing
non-`repr()`-roundtrippable values (`SpawnSpec` struct, callables,
etc).
* add cross-refs tracking issue `#379`.
Also,
- `Tracked at: [#449]` link on
`subint_forkserver_test_cancellation_leak_issue.md`.
- `Tracked at: [#450]` link on
`subint_forkserver_thread_constraints_on_pep684_issue.md`.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 5456195)
Major expansion of the module docstring. Code is unchanged; this lands the architectural reasoning that was previously implicit, plus the POSIX/trio fork mechanics the design relies on. New sections: - "Design rationale" — answers two implicit questions: (1) why a forkserver pattern at all (vs. forking directly from a trio task), (2) why in-process (vs. stdlib `mp.forkserver`'s sidecar process). Documents the three costs the in-process design avoids (sidecar lifecycle, per-spawn IPC, cold-start child) and the tradeoffs we accept in exchange (3.14-only, heavier than `to_thread.run_sync`). - "Implementation status" — clarifies what's actually landed today vs. the envisioned arch: parent's `trio.run()` still lives on main interp (subint- hosted root gated on msgspec/msgspec#1026). Names why the "subint" prefix is correct anyway — same PR series as `_subint.py` / `_subint_fork.py`. - "What survives the fork? — POSIX semantics" — POSIX preserves only the calling thread, so the `trio.run()` thread is gone in the child. Includes a small parent/child thread-survival table and covers the four artifact classes that DO cross the fork boundary (inherited fds, COW memory, Python thread state, user-level locks) and how each is handled. - "FYI: how this dodges the `trio.run()` × `fork()` hazards" — itemizes each class of trio process- global state (wakeup-fd, `epoll`/`kqueue`, threadpool, cancel scopes / nurseries, `atexit`, foreign-language I/O) and explains how the forkserver-thread design avoids each. Also, - bump the gated msgspec issue link from `msgspec/msgspec#563` to `msgspec/msgspec#1026` (the PEP 684 isolated-mode tracker). (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 3ab99d5)
Adds a "Future arch — what subints would buy us" section to the module docstring, complementing the prior commit's current-state rationale. Code is unchanged. Frames the `subint` prefix as family-naming today (no actual subinterp is created yet), then lays out the three concrete wins that land once msgspec/msgspec#1026 unblocks PEP 684 isolated-mode subints: - Cheaper forks — moving the parent's `trio.run()` into a subint shrinks the main-interp COW image the child inherits. The main interp becomes the literal forkserver: an intentionally-empty execution ctx whose only job is to call `os.fork()` cleanly. - True parallelism — per-interp GIL means the forkserver thread on main and the trio thread on subint actually run in parallel. Spawn latency stops stalling the trio loop. - Multi-actor-per-process — the architectural payoff. With per-interp-GIL subints, one process can host main + N subint-resident actor `trio.run()`s, and `os.fork()` reverts to the last-resort spawn (only when OS-level isolation is actually needed). Joins the story with the in-thread `_subint.py` backend: `subint` → in-process spawn, `subint_forkserver` → cross-process when a real OS boundary is required. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 4b5176e)
Move the truly-generic main-interp-worker-thread fork primitives (`fork_from_worker_thread`, `_close_inherited_fds`, `_ForkedProc`, `wait_child`, `_format_child_exit`) out of `_subint_forkserver.py` into a sibling `_main_thread_forkserver.py` module so the primitive layer is honestly named — none of these helpers touch a subint, they just fork from a main-interp worker thread. `_subint_forkserver.py` keeps its public surface intact via re-export so any existing `from tractor.spawn._subint_forkserver import ...` callsite still resolves. Net: zero behavior change, preps the way for the upcoming spawn-method key split where `main_thread_forkserver` ships as the working backend and `subint_forkserver` becomes reserved for the future subint-isolated-child variant (gated on msgspec/msgspec#1026). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 99dade0)
The `subint_forkserver` name was always aspirational — today's impl forks from a regular main-interp worker thread and the child runs trio on its own main interp; NO subinterp anywhere in parent or child. Splitting the backend into two clearly-named variants drops the lie: - **variant 1** — `main_thread_forkserver` (the working impl). New `SpawnMethodKey` literal + `_methods` dispatch entry + `_runtime.Actor._from_parent()` match-arm. The spawn-coro `subint_forkserver_proc` moves to `_main_thread_forkserver` and is renamed `main_thread_forkserver_proc()`. - **variant 2** — `subint_forkserver` (future, reserved). Module shrinks to a placeholder describing the variant-2 design (subint-isolated child runtime, gated on msgspec/msgspec#1026 + PEP 684). Today the legacy `'subint_forkserver'` key aliases to `main_thread_forkserver_proc` so existing `--spawn-backend=subint_forkserver` invocations keep working; flipped to a `NotImplementedError` stub in a follow-up. Deats, - `Actor._from_parent()` spawn-method gate now accepts both `'main_thread_forkserver'` and `'subint_forkserver'` (both go through the IPC-`SpawnSpec` path). - the variant-1 spawn-coro stamps its own `SpawnSpec` / log lines with `spawn_method='main_thread_forkserver'` so subactor renders reflect the actual mechanism. - docstring reorg: trio×fork hazard breakdown, POSIX fork-survival semantics, in-process-vs-stdlib forkserver design notes, and the TODO/cleanup section all move from `_subint_forkserver` to `_main_thread_forkserver` (lives with the working code). `_subint_forkserver` keeps a tight forward- looking doc that motivates the reserved key. - `run_subint_in_worker_thread()` stays in `_subint_forkserver` as the companion primitive — it's the subint counterpart to `fork_from_worker_thread()` and will plug into the future variant-2 spawn-coro. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 57dae0e)
Reduce `_subint_forkserver.py` to its variant-2 placeholder shape: - Add `subint_forkserver_proc` async stub raising `NotImplementedError` with a redirect msg pointing at the working variant-1 backend (`main_thread_forkserver`), msgspec/msgspec#1026 (upstream PEP 684 blocker), and #379 (subint umbrella). - `tractor.spawn._spawn._methods['subint_forkserver']` now dispatches to the stub instead of aliasing the variant-1 coroutine — `--spawn-backend=subint_forkserver` errors cleanly. - Drop now-dead module-scope: `ChildSigintMode` / `_DEFAULT_CHILD_SIGINT` defs, `_has_subints` try/except (replaced with import from `._subint`), unused imports (`partial`, `Literal`, `sys`, msgtypes/pretty_struct, `current_actor`, `cancel_on_completion`/`soft_kill`, `_server` TYPE_CHECKING). - Backward-compat re-exports of fork primitives kept until the follow-up commit migrates external test imports. - `tests/spawn/test_subint_forkserver.py::forkserver_spawn_method` fixture: flip hardcoded `'subint_forkserver'` → `'main_thread_forkserver'` so the test still exercises the working backend (full file rename comes in the test-import migration commit). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 5e83881)
Rename `tests/spawn/test_subint_forkserver.py` → `test_main_thread_forkserver.py` and migrate its imports + internal refs to the new canonical names: - `fork_from_worker_thread`, `wait_child` → from `tractor.spawn._main_thread_forkserver`. - `run_subint_in_worker_thread` → still from `_subint_forkserver` (variant-2 primitive). - Module docstring + tier-3 fixture + the `*_spawn_basic` test fn renamed for variant-1-honesty. - Orphan-harness subprocess argv flipped from `'subint_forkserver'` → `'main_thread_forkserver'`. `ai/conc-anal/subint_fork_from_main_thread_smoketest.py` imports split the same way. `tractor/spawn/_subint_forkserver.py` drops the backward- compat re-exports of the fork primitives — the only consumers (test file + smoketest) now import from `_main_thread_forkserver` directly. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 9f0709e)
Add `test_subint_forkserver_key_errors_cleanly` — a tn-tier regression guard that pins down the variant-2 reservation contract: the `'subint_forkserver'` key in `_spawn._methods` MUST raise `NotImplementedError` today, not silently dispatch to `main_thread_forkserver_proc`. The transient alias-state existed briefly during the rename (commit `57dae0e4`'s "Split forkserver backend into variant 1/2 mods" landed the alias; `5e83881f` flipped it to the stub). Without a guard, a future refactor could easily re-collapse the two keys back to a single coro and silently break the variant-1 / variant-2 contract. Also asserts the stub's error msg surfaces the two pointers an operator hitting it actually needs: - `'main_thread_forkserver'` — the working backend they prolly meant, - `'msgspec#1026'` — the upstream blocker that has to land before variant-2 can ship. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit cbdf1eb)
Two cleanup tweaks in `_main_thread_forkserver`: Doc, "what survives the fork?" section — expand the "non-calling threads are gone in the child" claim with the precise execution-vs-memory split that reconciles this module's prior framing with trio's (canonical [python-trio/trio#1614][trio-1614]) "leaked stacks" framing: - execution-side: only the calling thread runs post-fork; all others never execute another instruction. - memory-side: those non-running threads' stacks + per-thread heap structures are still COW-inherited as orphaned bytes — what trio means by "leaked". Same POSIX reality, opposite sides; the table is extended to a 4-col `parent | child (executing) | child (memory)` layout to make both views explicit. Also blank-line-padded the bulleted hazard classes for cleaner markdown rendering. [trio-1614]: python-trio/trio#1614 Code, `_close_inherited_fds()` log noise — split the catch-all `except OSError` into: - `EBADF` — benign race where the dirfd that `os.listdir('/proc/self/fd')` itself opened ends up in `candidates`, then auto-closes before the loop reaches it. Demote to `log.debug()` + `continue`; prior `log.exception` drowned the post-fork log channel with stack traces every spawn. - other errnos (EIO / EPERM / EINTR / ...) keep the loud `log.exception` surface — those ARE genuinely unexpected. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 8c73019)
`main_thread_forkserver` doesn't actually need py3.14 `concurrent.interpreters` (PEP 734) — it forks from a non-trio worker thread and runs `_trio_main` in the child, same shape as `trio_proc`. The previous `_has_subints` gate + subint-family `case` arm were a copy-paste error. In `tractor.spawn._main_thread_forkserver`, - drop the `_has_subints` import + the `RuntimeError` raise in `main_thread_forkserver_proc()`. - drop the now-unused `import sys` (only used by the prior error msg). In `tractor.spawn._spawn.try_set_start_method()`, - pull `'main_thread_forkserver'` out of the subint- family arm (which still gates on `_has_subints`). - merge it into the `'trio'` arm — both set `_ctx = None` bc neither needs an `mp.context`. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit fc5e80f)
Document the ~0.3% rotating `trio.TooSlowError` flake under `--spawn-backend=main_thread_forkserver` full-suite runs. Root cause: `hard_kill`'s per-sub 1.6s graceful timeout compounding across N subactors in a cancel cascade, plus cumulative autouse-reaper teardown overhead. Covers symptom, observed flaking tests, root-cause family, ranked mitigations (cap bump -> CPU-count- aware cap -> `pytest-rerunfailures` -> `hard_kill` tuning -> targeted profiling), and a verification protocol. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 60ce713)
Sends `SIGTERM` (graceful shutdown) instead of the existing `kill()` which sends `SIGKILL`. Mirrors the `trio.Process.terminate()` / `multiprocessing.Process.terminate()` interface. Used by `ActorNursery.cancel()`'s per-child escalation when `Portal.cancel_actor()` raises `ActorTooSlowError`, and by the legacy `hard_kill=True` branch. Swallows `ProcessLookupError` (child already dead) same as `kill()`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 4c00913)
Race `IPCServer.wait_for_peer(uid)` against the sub-proc's `.wait()` inside a `trio` nursery; whichever completes first cancels the other. Prevents the spawning task from parking forever on an unsignalled `_peer_connected[uid]` event when a sub-actor dies during boot (e.g. crashed on import before reaching `_actor_child_main`). Instead of hanging, raises `ActorFailure` w/ the proc's exit code for clean supervisor error reporting. Also, - use the new racer in `main_thread_forkserver_proc()` spawn path. - keep `proc_wait` generic so each backend passes its own callable (`trio.Process.wait`, `_ForkedProc.wait`, etc.). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 3b0724e)
Comment/docstring updates: `subint_forkserver` is a clean `NotImplementedError` stub — not an alias to variant-1 (`main_thread_forkserver`). Key reserved in-place (not aliased) so the subint-hosted-child impl can flip without API churn once msgspec/msgspec#1026 unblocks PEP 684 subints. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 7d1e446)
Add `capture` dimension to CI matrix so fork-based
backends run `--capture=sys` (fork-child × `--capture=fd`
is a known deadlock). Non-fork backends keep `fd`.
Deats,
- two `include:` rows for `main_thread_forkserver` on
linux py3.13: tcp + uds, both `capture: 'sys'`
- job name updated to show `capture=` mode
- timeout bumped 16 -> 20 min to accommodate the
additional matrix cells
- `--capture=${{ matrix.capture }}` replaces hardcoded
`--capture=fd`
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit a24600f)
Append "Snapshot evidence (2026-05-13)" section to `cancel_cascade_too_slow_under_main_thread_forkserver_issue.md` documenting `fail_after_w_trace` diag capture results for `test_nested_multierrors` under the MTF backend — reproduction cmd, ptree analysis, observed hang signature, and updated triage plan. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 5372fd1)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
main_thread_forkserverspawn backendMotivation
tractor wants a fork-without-exec spawn path: forking a subactor
inherits the parent's warm memory image (no re-
import, noexeccold-start), which is the groundwork for a future
subinterpreter-isolated child runtime. The trouble is
os.fork()ishostile on two fronts.
Two empirical CPython properties drive the whole design:
os.fork()from a non-main subinterpreter is refused byCPython —
PyOS_AfterFork_Child()aborts the child withFatal Python error: not main interpreter.os.fork()from a regularthreading.Threadattached to themain interpreter — one that has never entered a subint or
trio — works cleanly (validated across four scenarios on py3.14).
This branch operationalizes the second fact as a first-class tractor
spawn backend: spawn a clean worker thread, fork in it, hand the
child pid back to the trio task, and wrap it in a
trio.Process-shaped shim so the existingsoft_kill/hard-reapmachinery keeps working unchanged.
It supersedes #447 (the original
subint_forkserverattempt), rebuilt on the factored stack with an honest rename: the
working impl forks from a main-interp worker thread — there is no
subinterpreter anywhere in parent or child — so it's now variant-1
main_thread_forkserver, while the aspirationalsubint-isolated-child design is demoted to a reserved variant-2
subint_forkserverstub. The one-module impl is deceptively small;the bulk of the series is the supporting
subintscaffold plus thecancel-cascade / orphan-SIGINT hang diagnosis that only fork-based
teardown surfaces.
Src of research
concurrent.interpreterssupport incpython3.14+ msgspec/msgspec#1026ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.mdsubint_fork_from_main_thread_smoketest.pysubint_forkserver_thread_constraints_on_pep684_issue.mdSummary of changes
main_thread_forkserverplus the genericfork-from-main-interp-worker-thread primitives it's built on
(
spawn._main_thread_forkserver):fork_from_worker_thread(), the_ForkedProctrio.Processshim (pidfd-cancellable.wait(),.terminate(),.kill()), andmain_thread_forkserver_proc()which reuses
_spawn'ssoft_kill/hard-reap unchanged.subintspawn backend (spawn._subint,py3.14+):
_interpreters-based subactor exec driven on a dedicatedOS thread, a destroy-race fix via that driver thread, and
hard-kill-timeout-bounded teardown shields.
subint_forkserveras a placeholder stub(
spawn._subint_forkserver): the legacy key aliases to variant-1'sspawn-coro for now; the real subint-isolated-child runtime is gated
on msgspec PEP-684 support.
subint_fork(variant-2b) CPython-blocked workaroundsmoketest documenting the non-main-interp fork abort.
spawn._spawnbackend dispatch and wireruntime._runtime.Actor._from_parent()so both forkserver keysroute through the
SpawnSpecIPC handshake.tests.spawnsubpkg —test_main_thread_forkserver(incl.an
xfail(strict=True)orphan-SIGINT DRAFT) andtest_subint_cancellationaudits.main_thread_forkserverCI matrix rows with a matrix-driven--capture=mode (sysfor fork backends to dodge the fork-child× capture-fd deadlock).
ai/conc-anal/post-mortems(SIGINT starvation, cancel-cascade too-slow, capture-pipe leak,
PEP-684 thread constraints) plus
ai/prompt-io/session logs.Scopes changed
tractor.spawn._main_thread_forkserverthe original
_subint_forkserver.tractor.spawn._subint_forkservertractor.spawn._subint,tractor.spawn._subint_forksmoketest).
tractor.spawn._spawnSpawnMethodKeyentries.tests.spawntest_subint_forkserver->test_main_thread_forkserver.TODOs before landing
trio_033_upgrade— must mergeafter the lower stack lands; rebase forward as each lands.
skip/xfail-gated for CI:
subint_forkserver:test_nested_multierrorscancel-cascade hang gated by pytest--capture=fd#449 (capture-fd cancel-cascade) andmain_thread_forkserver: cancel-cascade occasionally hangs >9s under heavy fork-spawn contention #451(mtf cancel-cascade >9s).
Future follow up
threading.Threadtotrio.to_thread.run_sync()once msgspecPEP-684 isolated subints dissolve the GIL-starvation /
tstate-recycling hazards — see #450. The dedicated worker
thread itself stays (MTF's fork-from-main-tstate invariant is
permanent); only its trio integration changes, and the destroy-race
hazard needs empirical re-test.
_subint/_*_forkserverextend that scaffold, and this PR lands#445's Phase B (
_subint) + Phase C (harness).subint_forkserver(subint-isolated childruntime) once msgspec#1026 / PEP-684 isolated
subints land. Currently a stub in
spawn._subint_forkserver.spawn._subint?TODO(~L256): hardImportError/syntax failuresonly hit the thread excepthook today.
SpawnSpecbootstrap via_interpreters.set___main___attrs()for non-repr-roundtrippablevalues —
spawn._subint?TODO(~L205).child_sigint='trio')to close the orphan-SIGINT gap — scaffolded behind the
xfail(strict=True)DRAFT intests.spawn.test_main_thread_forkserver.os.pidfd_open()for non-Linux in_ForkedProc(spawn._main_thread_forkserver~L698) — it'sLinux-only, so the macOS CI row
AttributeErrors. Theno_pidfd_openbranch already carries the fix (hasattrguard +waitpidfallback). Not a merge blocker — macOS support isn'trequired while Add
main_thread_forkserverbackend #463 is pre-production.Links
fork()payoff)subint_forkserverPR, superseded by thisos.fork()× trio post-fork hazards(this pr content was generated in some part by
claude-code)