tests: layered defense against IPC child-process hangs (#2004)#2124
Conversation
Replace the per-test kill-on-timeout approach with three layered fixes that defend the IPC test suite against compute-sanitizer + CUDA driver / toolkit combinations that wedge child processes during IPC teardown. CI layer: ci/tools/setup-sanitizer now passes --target-processes=application-only so spawned multiprocessing.Process children run without the sanitizer attached. This eliminates the root cause of the hang reported in issue NVIDIA#2004 (compute-sanitizer's IPC teardown analysis getting stuck under Python 3.12 + CUDA 12.9.1). The parent pytest process is still fully sanitized. Fixture layer: ipc_device and ipc_mempool_device_x2 fixtures now wrap their yield in track_child_processes(), a context manager that patches multiprocessing.process.BaseProcess.__init__ to record every Process instance constructed during the test and kills any survivor at teardown. This protects ipc_memory_resource.mr.close() from blocking on IPC handles held by a stuck child, regardless of the original failure mode. Outer-guard layer: pytest-timeout is added to the test dep group, and tests/memory_ipc/conftest.py applies pytest.mark.timeout(300) to every test in the directory. This is the final fallback so no IPC test can wedge the GHA runner for hours if a new failure mode defeats the earlier layers. The hardcoded CHILD_TIMEOUT_SEC = 30 in every IPC test module is replaced with a call to helpers.child_processes.child_timeout_sec(), which returns 30 by default and 120 under compute-sanitizer (detected via the existing under_compute_sanitizer() helper). The unused CHILD_TIMEOUT_SEC declaration in test_workerpool.py is removed.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
This comment has been minimized.
This comment has been minimized.
Reword the inline comment in ci/tools/setup-sanitizer and the docstrings in helpers/child_processes.py and memory_ipc/conftest.py to make clear that the underlying problem is insufficient guards in the IPC tests when child processes spawn slowly (>30s under compute-sanitizer). The sanitizer change and the new helpers are mitigations / safety nets; the durable fix is making the tests handle slow children correctly.
Make every IPC test responsible for its own child-process cleanup, rather
than leaning on the fixture-level tracker as the primary mechanism.
helpers.child_processes.kill_subprocesses(*processes) returns the list of
processes that were still alive when called and kills them. Tests pair this
with an "assert not survivors" check, so a timeout produces a clean failure
message ("timed out waiting on: ['Process-3']") instead of "assert None ==
0", and the held IPC handles are released before any further test code
runs.
Every join+exitcode pattern in tests/memory_ipc/ is converted to the new
shape:
process.join(timeout=CHILD_TIMEOUT_SEC)
survivors = kill_subprocesses(process)
assert not survivors, "child did not exit within timeout"
assert process.exitcode == 0
For test_send_buffers.py:TestIpcReexport (the test from NVIDIA#2121), the
event.wait return value is also captured and asserted on so a timeout there
is reported explicitly rather than dropped on the floor.
The autouse track_child_processes() context manager in the ipc_device
fixture remains as defense in depth for any future test that forgets the
pattern.
memory_ipc/test_errors.py adds a meta-test test_outer_timeout_marker_is_applied
that verifies tests/memory_ipc/conftest.py is loaded and applies the
pytest.mark.timeout(300) marker. This catches the "nested conftest silently
not picked up" failure mode at test time with a clear error message.
Without __init__.py in tests/memory_ipc/, pytest imports
tests/memory_ipc/conftest.py under module name `conftest`, which collides
with the top-level tests/conftest.py. Tests under tests/ that use
`from conftest import X` then resolve to the wrong file at import time,
yielding ImportErrors like:
ImportError: cannot import name 'skipif_need_cuda_headers' from
'conftest' (.../tests/memory_ipc/conftest.py)
The fix mirrors tests/system/ (which already has both __init__.py and
conftest.py): adding tests/memory_ipc/__init__.py makes memory_ipc a
proper package, so its conftest is imported as memory_ipc.conftest and
the top-level conftest stays under name `conftest`.
Replace the fixed 300 s pytest.mark.timeout with 3 * child_timeout_sec(). This is the minimum that lets the worst-case test in the suite (currently TestIpcReexport, with three sequential CHILD_TIMEOUT_SEC waits in the failure path) reach its own asserts before the outer guard fires. Resulting budgets: - Without compute-sanitizer: 90 s (down from 300 s). - Under compute-sanitizer: 360 s (up from 300 s -- the previous value would have killed TestIpcReexport mid-second-join before its "process C did not signal completion within timeout" assertion could produce a clean failure message). The meta-test test_outer_timeout_marker_is_applied now computes the expected timeout from the same formula so the two stay in sync.
|
/ok to test |
The previous 3 * CHILD_TIMEOUT_SEC scaling assumed worst-case wall-clock where every sequential join/wait hits its full timeout. In practice the children run concurrently, so expected wall-clock is ~CHILD_TIMEOUT_SEC regardless of how many joins the test chains -- once a child is done its join returns immediately. Exceeding CHILD_TIMEOUT_SEC + slack already means something is genuinely stuck, in which case the outer guard firing is the right outcome; the autouse track_child_processes() context manager still cleans up survivors, and the per-test diagnostic message would not be more informative than "test exceeded its budget". New budgets: - Without compute-sanitizer: 60 s (was 90 s). - Under compute-sanitizer: 150 s (was 360 s). The meta-test computes its expected value from the same formula.
|
/ok to test |
rwgk
left a comment
There was a problem hiding this comment.
One caveat identified by GPT-5.5. But the CI succeeds now, so it can wait until later, or if we actually see problems.
|
|
||
| return _require_ipc_mempool_devices((device,))[0] | ||
| device = _require_ipc_mempool_devices((device,))[0] | ||
| with track_child_processes(): |
There was a problem hiding this comment.
Non-blocking: since ipc_memory_resource depends on ipc_device, its teardown still runs before this track_child_processes() context exits. So this fixture-level guard may not protect mr.close() if a test fails before reaching its in-test kill_subprocesses() cleanup. The per-test cleanup and pytest-timeout still cover the known hang, so I don’t think this should block.
|
I merged: I'll update #2118 next. |
|
Summary
Implements layered fixes that defend the IPC test suite against failures that wedge CI. With these layers in place, no IPC test can wedge the GHA runner for hours, even if a new failure mode emerges.
Deadlock was observed due to the following sequence:
mr.close()because the subprocess holds an open IPC handle.This is a bug in the test source: the result of the parent's wait needs to be checked. #2121 fixes some of these tests; this PR generalizes the fix to every IPC test.
This was observed in Python 3.12 test runs, which run under compute-sanitizer, presumably because compute-sanitizer slows down process spawning.
Changes
CI layer.
ci/tools/setup-sanitizernow passes--target-processes=application-onlyso spawnedmultiprocessing.Processchildren run without compute-sanitizer attached. This should improve the start-up time for spawned processes.In-test layer. New helper
helpers.child_processes.kill_subprocesses(*processes)kills any Process objects still alive at call time and returns them. Everyprocess.join(timeout=...)site intests/memory_ipc/now follows the canonical pattern below, so a timeout produces a clean failure message ("timed out waiting on: ['Process-3']") instead ofassert None == 0, and held IPC handles are released before any further test code runs:For
test_send_buffers.py:TestIpcReexport(the test from #2121), theevent_c.waitreturn value is also captured and asserted on, so a timeout there is reported explicitly instead of being dropped on the floor.Fixture layer.
ipc_deviceandipc_mempool_device_x2now wrap theiryieldintrack_child_processes(), a context manager inhelpers.child_processesthat patchesmultiprocessing.process.BaseProcess.__init__to record everyProcessinstance constructed during the test and kills any survivor at teardown. This is defense in depth for tests that may still slip through the in-test layer — it protectsipc_memory_resource.mr.close()from blocking on IPC handles held by a stuck child, regardless of the original failure mode.Outer-guard layer.
pytest-timeoutis added to the test dep group, andtests/memory_ipc/conftest.pyappliespytest.mark.timeout(child_timeout_sec() + 30)to every test in the directory (60 s without compute-sanitizer, 150 s under it). IPC test children run concurrently, so a well-behaved test finishes in ~CHILD_TIMEOUT_SECregardless of how many joins it chains; exceeding that already means something is genuinely stuck, at which point the outer guard firing is the right outcome (the autousetrack_child_processes()context manager still cleans up survivors). A meta-testtest_outer_timeout_marker_is_appliedintests/memory_ipc/test_errors.pyverifies viarequest.node.get_closest_marker("timeout")that the per-directory conftest is being loaded and applies the marker with the expected value; if a future change breaks conftest discovery or the formula drifts, this test fails immediately with a clear message.(
tests/memory_ipc/also gains an empty__init__.pyso its conftest is imported asmemory_ipc.conftestrather than colliding with the top-levelconftestmodule that other tests reference viafrom conftest import ....)Timeout helper. The hardcoded
CHILD_TIMEOUT_SEC = 30in every IPC test module is replaced with a call tohelpers.child_processes.child_timeout_sec(), which returns 30 by default and 120 under compute-sanitizer. The unusedCHILD_TIMEOUT_SECdeclaration intest_workerpool.pyis removed.Test Coverage
track_child_processes()context manager viaipc_device(andipc_mempool_device_x2), and every join/exitcode pair now goes throughkill_subprocesses. Happy-path tests verify no regression; any test that hits the original hang will fail fast with a clear timeout message instead of wedging.test_outer_timeout_marker_is_appliedwill fail iftests/memory_ipc/conftest.pyis not loaded or does not apply the timeout marker.kill_subprocessesreturns a list of still-alive processes (after killing them with SIGKILL, exitcode-9), an empty list for cleanly-exited or never-started Process objects, and never raises on either kind.track_child_processes()correctly kills a child still alive at scope exit and leaves cleanly-exiting children untouched.child_timeout_sec()returns 30 by default and 120 whenCUDA_PYTHON_UNDER_SANITIZER=1(or any of the other signals checked byunder_compute_sanitizer()).Related Work
kill_subprocessesand adds an autouse cleanup tracker inipc_device).