tests: kill zombie IPC child processes after join timeout#2121
tests: kill zombie IPC child processes after join timeout#2121aryanputta wants to merge 5 commits into
Conversation
When Python 3.12 CI runs, env-vars enables compute-sanitizer with --target-processes=all, which attaches to every mp.Process child the tests spawn. On CUDA 12.9.1 the sanitizer analysis of IPC buffer teardown gets stuck, so child processes never exit. The existing join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive. That zombie keeps its IPC handle open. When pytest teardown runs ipc_memory_resource's mr.close(), it blocks waiting for the handle to be released — tying up the runner for hours until GitHub Actions force-cancels the job. This is the exact pattern in issue NVIDIA#2004 (always Python 3.12 + CUDA 12.9.1 local). Fix: after join(timeout=...), kill any process still alive so the IPC handle is released before fixture teardown. Tests still fail (exit code is non-zero or completed is False), just in seconds rather than hours. Fixes NVIDIA#2004
72ccbe7 to
df61e49
Compare
|
Thanks @aryanputta for jumping in! I'm running a Cursor code review right now... |
When Python 3.12 CI runs, env-vars enables compute-sanitizer with --target-processes=all, which attaches to every mp.Process child the tests spawn. On CUDA 12.9.1 the sanitizer analysis of IPC buffer teardown gets stuck, so child processes never exit. The existing join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive. That zombie keeps its IPC handle open. When pytest teardown runs ipc_memory_resource's mr.close(), it blocks waiting for the handle to be released -- tying up the runner for hours until GitHub Actions force-cancels the job. This is the exact pattern in issue NVIDIA#2004 (always Python 3.12 + CUDA 12.9.1 local). Fix: after join(timeout=...), kill any process still alive so the IPC handle is released before fixture teardown. A stderr warning is printed when kill() fires so the failure is clearly attributable to the sanitizer/IPC deadlock rather than appearing as a generic exitcode != 0. Tests still fail (exit code is non-zero or completed is False), just in seconds rather than hours. Fixes NVIDIA#2004
948d30c to
5b802a0
Compare
rwgk
left a comment
There was a problem hiding this comment.
PR 2121 Initial Review
No blocking findings.
The change matches the issue #2004 observed failure mode: after a child process
exceeds join(timeout=30), it is now killed before parent-side teardown can
block on IPC handles. The TestIpcReexport path also now asserts completed,
so a timed-out child becomes a real fast failure instead of quietly continuing
into teardown.
Residual risks
- Other
memory_ipctests also usejoin(timeout=CHILD_TIMEOUT_SEC)without
killing children. The evidence in issue 2004 points specifically at
test_send_buffers.py, so I would not block this PR on broadening the fix,
but it may be worth a follow-up helper if hangs reappear elsewhere. - Full CI has not run yet because the external-contributor PR is awaiting
validation. Current checks are only metadata/pre-commit. - I could not run the focused test locally because
TestVenvappears stale for
this branch and fails importingcuda.core._memory._managed_buffer.
git diff --checkpassed.
|
Oh - @aryanputta please don't force push. |
|
/ok to test 036a8f5 |
sorry!! |
No worries. It was only very slightly confusing. Holding my breath that the two troublesome jobs (under PR #2118) will sail through now. |
haha, me too, I always test locally aswell, i got no issues so hopefully its the same.. edit, did i stop it by accdient, i clicked comment and it think i clicked stop.. I am so Sorry |
|
|
CI / Test linux-64 / Python 3.12, CUDA 13.2.1 (wheels), GPU rtx4090, wsl (push) seems to be a flake. I'll keep an eye out here and will rerun when the rest has finished. |
|
This still seems to hang unfortunately: CI / Test linux-64 / Python 3.12, CUDA 12.9.1 (local), GPU l4 (push) @Andy-Jost for visibility @aryanputta FYI: Andy is experimenting under #2124 with broader fixes. |
|
I cancelled the hanging job. Waiting for @Andy-Jost's experiments under #2124. |
The fix in the initial commit only covered test_send_buffers.py. The 63-minute CI hang confirmed the deadlock was occurring in one of the other test files. Applies the same kill() after join(timeout=...) pattern to all remaining memory_ipc tests: - test_errors.py (1 join) - test_event_ipc.py (2 joins) - test_ipc_duplicate_import.py (1 join) - test_leaks.py (3 joins) - test_memory_ipc.py (7 joins across 4 test classes) - test_peer_access.py (2 joins) - test_serialize.py (3 joins) See issue NVIDIA#2004.
|
hey @rwgk @Andy-Jost -- after seeing the 63min hang in ci (step 29 died in the cuda.core test run) i checked the rest of the memory_ipc suite and found 7 more files with bare just pushed a second commit that extends the same kill-after-join pattern to all remaining files:
andy's layered approach in #2124 is the cleaner fix so please prioritize that -- but if you need a fallback while #2124 is still in draft, this pr now covers the full suite. no need to run ci here unless #2124 doesn't work out |
The kill() calls print to sys.stderr -- these three files were missing the import and would have raised NameError at runtime under the hang path.
Make every IPC test responsible for its own child-process cleanup, rather
than leaning on the fixture-level tracker as the primary mechanism.
helpers.child_processes.kill_subprocesses(*processes) returns the list of
processes that were still alive when called and kills them. Tests pair this
with an "assert not survivors" check, so a timeout produces a clean failure
message ("timed out waiting on: ['Process-3']") instead of "assert None ==
0", and the held IPC handles are released before any further test code
runs.
Every join+exitcode pattern in tests/memory_ipc/ is converted to the new
shape:
process.join(timeout=CHILD_TIMEOUT_SEC)
survivors = kill_subprocesses(process)
assert not survivors, "child did not exit within timeout"
assert process.exitcode == 0
For test_send_buffers.py:TestIpcReexport (the test from NVIDIA#2121), the
event.wait return value is also captured and asserted on so a timeout there
is reported explicitly rather than dropped on the floor.
The autouse track_child_processes() context manager in the ipc_device
fixture remains as defense in depth for any future test that forgets the
pattern.
memory_ipc/test_errors.py adds a meta-test test_outer_timeout_marker_is_applied
that verifies tests/memory_ipc/conftest.py is loaded and applies the
pytest.mark.timeout(300) marker. This catches the "nested conftest silently
not picked up" failure mode at test time with a clear error message.
|
@aryanputta Thank you very much for diagnosing the root cause of this. This led straight to #2124, which it appears will resolve the issue. Once that merges, we can close this PR. |
glad to help!! |
* tests: layered defense against IPC child-process hangs (#2004) Replace the per-test kill-on-timeout approach with three layered fixes that defend the IPC test suite against compute-sanitizer + CUDA driver / toolkit combinations that wedge child processes during IPC teardown. CI layer: ci/tools/setup-sanitizer now passes --target-processes=application-only so spawned multiprocessing.Process children run without the sanitizer attached. This eliminates the root cause of the hang reported in issue #2004 (compute-sanitizer's IPC teardown analysis getting stuck under Python 3.12 + CUDA 12.9.1). The parent pytest process is still fully sanitized. Fixture layer: ipc_device and ipc_mempool_device_x2 fixtures now wrap their yield in track_child_processes(), a context manager that patches multiprocessing.process.BaseProcess.__init__ to record every Process instance constructed during the test and kills any survivor at teardown. This protects ipc_memory_resource.mr.close() from blocking on IPC handles held by a stuck child, regardless of the original failure mode. Outer-guard layer: pytest-timeout is added to the test dep group, and tests/memory_ipc/conftest.py applies pytest.mark.timeout(300) to every test in the directory. This is the final fallback so no IPC test can wedge the GHA runner for hours if a new failure mode defeats the earlier layers. The hardcoded CHILD_TIMEOUT_SEC = 30 in every IPC test module is replaced with a call to helpers.child_processes.child_timeout_sec(), which returns 30 by default and 120 under compute-sanitizer (detected via the existing under_compute_sanitizer() helper). The unused CHILD_TIMEOUT_SEC declaration in test_workerpool.py is removed. * tests: reframe IPC hang docs as test-side bug rather than CS bug Reword the inline comment in ci/tools/setup-sanitizer and the docstrings in helpers/child_processes.py and memory_ipc/conftest.py to make clear that the underlying problem is insufficient guards in the IPC tests when child processes spawn slowly (>30s under compute-sanitizer). The sanitizer change and the new helpers are mitigations / safety nets; the durable fix is making the tests handle slow children correctly. * tests: add kill_subprocesses helper and apply to IPC tests Make every IPC test responsible for its own child-process cleanup, rather than leaning on the fixture-level tracker as the primary mechanism. helpers.child_processes.kill_subprocesses(*processes) returns the list of processes that were still alive when called and kills them. Tests pair this with an "assert not survivors" check, so a timeout produces a clean failure message ("timed out waiting on: ['Process-3']") instead of "assert None == 0", and the held IPC handles are released before any further test code runs. Every join+exitcode pattern in tests/memory_ipc/ is converted to the new shape: process.join(timeout=CHILD_TIMEOUT_SEC) survivors = kill_subprocesses(process) assert not survivors, "child did not exit within timeout" assert process.exitcode == 0 For test_send_buffers.py:TestIpcReexport (the test from #2121), the event.wait return value is also captured and asserted on so a timeout there is reported explicitly rather than dropped on the floor. The autouse track_child_processes() context manager in the ipc_device fixture remains as defense in depth for any future test that forgets the pattern. memory_ipc/test_errors.py adds a meta-test test_outer_timeout_marker_is_applied that verifies tests/memory_ipc/conftest.py is loaded and applies the pytest.mark.timeout(300) marker. This catches the "nested conftest silently not picked up" failure mode at test time with a clear error message. * tests: add memory_ipc/__init__.py to avoid conftest module collision Without __init__.py in tests/memory_ipc/, pytest imports tests/memory_ipc/conftest.py under module name `conftest`, which collides with the top-level tests/conftest.py. Tests under tests/ that use `from conftest import X` then resolve to the wrong file at import time, yielding ImportErrors like: ImportError: cannot import name 'skipif_need_cuda_headers' from 'conftest' (.../tests/memory_ipc/conftest.py) The fix mirrors tests/system/ (which already has both __init__.py and conftest.py): adding tests/memory_ipc/__init__.py makes memory_ipc a proper package, so its conftest is imported as memory_ipc.conftest and the top-level conftest stays under name `conftest`. * tests: scale memory_ipc outer timeout with child_timeout_sec Replace the fixed 300 s pytest.mark.timeout with 3 * child_timeout_sec(). This is the minimum that lets the worst-case test in the suite (currently TestIpcReexport, with three sequential CHILD_TIMEOUT_SEC waits in the failure path) reach its own asserts before the outer guard fires. Resulting budgets: - Without compute-sanitizer: 90 s (down from 300 s). - Under compute-sanitizer: 360 s (up from 300 s -- the previous value would have killed TestIpcReexport mid-second-join before its "process C did not signal completion within timeout" assertion could produce a clean failure message). The meta-test test_outer_timeout_marker_is_applied now computes the expected timeout from the same formula so the two stay in sync. * tests: tighten memory_ipc outer timeout to CHILD_TIMEOUT_SEC + 30 The previous 3 * CHILD_TIMEOUT_SEC scaling assumed worst-case wall-clock where every sequential join/wait hits its full timeout. In practice the children run concurrently, so expected wall-clock is ~CHILD_TIMEOUT_SEC regardless of how many joins the test chains -- once a child is done its join returns immediately. Exceeding CHILD_TIMEOUT_SEC + slack already means something is genuinely stuck, in which case the outer guard firing is the right outcome; the autouse track_child_processes() context manager still cleans up survivors, and the per-test diagnostic message would not be more informative than "test exceeded its budget". New budgets: - Without compute-sanitizer: 60 s (was 90 s). - Under compute-sanitizer: 150 s (was 360 s). The meta-test computes its expected value from the same formula.
Removed preview folders for the following PRs: - PR #2121
What
TestIpcSendBuffersandTestIpcReexportcallprocess.join(timeout=30)to wait for child processes but never kill them if the join times out. This causes multi-hour CI hangs.Root cause
ci/tools/env-varsenables compute-sanitizer (--target-processes=all) on Python 3.12 + local CTK + Linux — exactly the combination in every hanging run from issue #2004. With--target-processes=all, compute-sanitizer attaches to everymp.Processchild. On CUDA 12.9.1 its IPC teardown analysis gets stuck, so the child never exits.The parent's
join(timeout=30)returns after 30 seconds, but the child is still alive. That zombie holds an open IPC memory handle. When pytest runs fixture teardown (ipc_memory_resource'smr.close()), it blocks on the held handle indefinitely. The runner gets tied up for 4–6 hours until GitHub Actions force-cancels the job.Fix
After
join(timeout=...), kill any process that is still alive. This forces IPC handle release before fixture teardown runs. The test still fails (exit code is non-zero, orcompletedis False), just in seconds instead of hours.No behavior change when children exit cleanly.
Related
Closes #2004