Skip to content

fix(python): restore deterministic teardown for async-callback machinery#2009

Open
chaliy wants to merge 1 commit into
mainfrom
claude/determined-goldberg-1n00vy
Open

fix(python): restore deterministic teardown for async-callback machinery#2009
chaliy wants to merge 1 commit into
mainfrom
claude/determined-goldberg-1n00vy

Conversation

@chaliy

@chaliy chaliy commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What

Restores deterministic teardown for the async-callback machinery, removing the tradeoffs introduced by #2007/#2008 (which fixed the TM-PY-030 deadlocks/crash by going hands-off: shutdown_background(), deferred loop close). Determinism is now full while the interpreter is alive; hands-off behavior remains only at interpreter finalization, where CPython itself forbids native threads from attaching (< 3.13).

Protocol

  • Exit boundary: an atexit handler registered at module import sets INTERPRETER_AT_EXIT. atexit runs at the very start of Py_FinalizeEx, strictly before the phase in which native threads may no longer attach — so the flag cleanly splits "interpreter alive → deterministic teardown is safe" from "exiting → hands off, OS reclaims".
  • Cooperative cancellation: each private-loop callback runs as a published asyncio.Task; teardown cancels it via call_soon_threadsafe(task.cancel), and a closing flag rejects queued-but-unstarted items. Joins are bounded by cancellation, not by callback duration.
  • Deterministic joins, GIL released: PyPrivateAsyncLoop::shutdown (Drop) joins its worker — which closes its asyncio loop before exiting, freeing fds before drop returns. PyRuntime::drop joins the tokio blocking pool again (pre-fix(ci): repair drift-workflow YAML and fix GIL deadlocks hanging Coverage #2007 semantics). Every join goes through join_without_gil (detach first when PyGILState_Check says the dropping thread holds the GIL) — eliminating the deadlock instead of avoiding the join.
  • Ordering: pyclass Drop impls cancel in-flight callbacks via an engine registry of live per-session loops before the rt field drop joins the pool (an abandoned task holds its own session Arc, so the registry is the only path to reach it).

This also removes both #2008 tradeoffs: the loop is closed synchronously at teardown (no deferred-__del__ fd lifetime, no ResourceWarning under strict warning filters).

Proof of determinism

New tests/test_teardown_determinism.py:

  • Exact native-thread-count equality immediately after del tool; gc.collect(), across repeated churn — proves worker + runtime threads are joined synchronously in drop (/proc/self/task).
  • Exact fd stability across 20 create/execute/drop cycles — proves the asyncio loop is closed by drop, not a later gc pass.
  • Bounded cancellation: a timed-out callback sleeping 30 s is cancelled at teardown; drop returns in < 5 s and the callback observes CancelledError.
  • Interpreter-exit subprocess stress (10× each): module-level tool torn down by Py_Finalize, both clean and with an abandoned in-flight callback — no SIGABRT.

Adversarial verification beyond the committed tests:

  • Race-sensitive suites (test_teardown_determinism.py + test_async_callbacks.py) looped 20× — includes 400 subprocess exit checks.
  • Concurrent stress: 8 threads × 40 mixed iterations (fast callbacks, timeout+immediate-drop, multi-call tools) looped 10× (3,200 iterations) with a hang watchdog — zero deadlocks.
  • langgraph_async_tool.py (the fix(python): keep private-loop worker off Python during interpreter exit #2008 crasher) 40× — zero aborts.
  • Full bashkit-python suite: 705 passed, 1 skipped. just pre-pr green (fmt, clippy, tests, vet).

One existing test updated: test_dealloc_during_inflight_callback_does_not_deadlock asserted the abandoned callback "finishes on its own" (the old hands-off semantics); it now asserts prompt teardown plus completion-or-cancellation, which is the new contract.

Known bound (documented)

Cancellation is cooperative: a callback blocking without awaiting (time.sleep inside async def) can't be interrupted mid-section; teardown waits for it to reach an await point or return — with the GIL released, so it cannot deadlock. Documented in specs/python-package.md and the TM-PY-030 entry.

Specs

  • specs/threat-model.md: TM-PY-030 rewritten around the deterministic teardown protocol.
  • specs/python-package.md: new "Teardown determinism" section.

Do not merge yet — awaiting full CI matrix (3.9–3.14) as the final leg of the proof.


Generated by Claude Code

Copilot AI review requested due to automatic review settings June 10, 2026 04:33
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 10, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
🔵 In progress
View logs
bashkit c93b4ed Jun 10 2026, 11:59 AM

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restores deterministic teardown for the Python async-callback “private loop” machinery (TM-PY-030), bringing back bounded, synchronous cleanup while the interpreter is alive, while preserving “hands-off” behavior during interpreter finalization (where native thread attachment is unsafe on CPython < 3.13).

Changes:

  • Implement interpreter-exit boundary tracking via an atexit-set flag and use it to switch between deterministic teardown vs. exit-time hands-off behavior.
  • Make private-loop callbacks cancellable via published asyncio.Tasks, and perform deterministic joins with the GIL released (including restoring tokio runtime drop joins while the interpreter is alive).
  • Add/adjust tests to assert deterministic thread/fd cleanup, bounded cancellation, and non-crashing interpreter-exit behavior; update specs to document the protocol and threat-model mitigation.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
specs/threat-model.md Updates TM-PY-030 to reflect deterministic teardown + exit-time crash variant and the new mitigation protocol.
specs/python-package.md Documents the “Teardown determinism” contract and its interpreter-exit boundary behavior.
crates/bashkit-python/tests/test_teardown_determinism.py New regression tests for deterministic joins, fd stability, bounded cancellation, and subprocess exit robustness.
crates/bashkit-python/tests/test_async_callbacks.py Updates TM-PY-030 regression expectations to match “prompt teardown + completion-or-cancellation.”
crates/bashkit-python/src/lib.rs Implements the teardown protocol: atexit flag, GIL-free joins, task publication/cancellation, worker joining/loop closing, engine registry for per-session loops, and Drop ordering hooks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

PRs #2007/#2008 fixed the TM-PY-030 deadlocks and exit crash by making
teardown hands-off (shutdown_background, no loop close), trading
deterministic cleanup for forward progress. This restores determinism
while the interpreter is alive and keeps hands-off behavior only where
CPython makes determinism impossible (interpreter finalization).

Protocol:
- An atexit handler registered at module import sets INTERPRETER_AT_EXIT.
  atexit runs at the very start of Py_FinalizeEx, strictly before the
  phase in which native threads may no longer attach, so the flag cleanly
  separates 'interpreter alive' from 'process exiting'.
- Each private-loop callback runs as a published asyncio.Task. Teardown
  cancels it via call_soon_threadsafe(task.cancel) and a closing flag
  rejects queued-but-unstarted items, so joins are bounded by cooperative
  cancellation instead of full callback duration.
- PyPrivateAsyncLoop::shutdown (Drop) joins its worker thread, which
  closes its asyncio loop before exiting — fds freed before drop returns.
- PyRuntime::drop joins the tokio blocking pool again (the pre-#2007
  semantics) instead of shutdown_background.
- Every join runs through join_without_gil: detach first when
  PyGILState_Check reports the dropping thread attached. This removes the
  GIL deadlock rather than avoiding the join.
- Pyclass Drop impls (ScriptedTool/Bash/BashTool) cancel in-flight
  callbacks through an engine registry of live per-session loops before
  the rt field drop joins the pool.
- At exit: threads skip Python entirely (flag check, no Python::attach),
  runtime falls back to shutdown_background, OS reclaims resources.

Verification:
- New tests/test_teardown_determinism.py: exact native-thread-count and
  fd-count stability across tool churn (joins are synchronous in drop),
  bounded cancellation of abandoned callbacks, and 10x subprocess
  interpreter-exit checks for both clean and abandoned-callback exits.
- Race-sensitive suites looped 20x, concurrent stress (8 threads x 40
  mixed iterations incl. timeout+drop churn) looped 10x, langgraph
  example 40x: zero hangs, zero aborts.
- Full bashkit-python suite: 705 passed, 1 skipped. just pre-pr green.
@chaliy chaliy force-pushed the claude/determined-goldberg-1n00vy branch from c14b1fa to c93b4ed Compare June 10, 2026 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants