tests: kill zombie IPC child processes after join timeout by aryanputta · Pull Request #2121 · NVIDIA/cuda-python

aryanputta · 2026-05-21T16:01:41Z

What

TestIpcSendBuffers and TestIpcReexport call process.join(timeout=30) to wait for child processes but never kill them if the join times out. This causes multi-hour CI hangs.

Root cause

ci/tools/env-vars enables compute-sanitizer (--target-processes=all) on Python 3.12 + local CTK + Linux — exactly the combination in every hanging run from issue #2004. With --target-processes=all, compute-sanitizer attaches to every mp.Process child. On CUDA 12.9.1 its IPC teardown analysis gets stuck, so the child never exits.

The parent's join(timeout=30) returns after 30 seconds, but the child is still alive. That zombie holds an open IPC memory handle. When pytest runs fixture teardown (ipc_memory_resource's mr.close()), it blocks on the held handle indefinitely. The runner gets tied up for 4–6 hours until GitHub Actions force-cancels the job.

Fix

After join(timeout=...), kill any process that is still alive. This forces IPC handle release before fixture teardown runs. The test still fails (exit code is non-zero, or completed is False), just in seconds instead of hours.

No behavior change when children exit cleanly.

When Python 3.12 CI runs, env-vars enables compute-sanitizer with --target-processes=all, which attaches to every mp.Process child the tests spawn. On CUDA 12.9.1 the sanitizer analysis of IPC buffer teardown gets stuck, so child processes never exit. The existing join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive. That zombie keeps its IPC handle open. When pytest teardown runs ipc_memory_resource's mr.close(), it blocks waiting for the handle to be released — tying up the runner for hours until GitHub Actions force-cancels the job. This is the exact pattern in issue NVIDIA#2004 (always Python 3.12 + CUDA 12.9.1 local). Fix: after join(timeout=...), kill any process still alive so the IPC handle is released before fixture teardown. Tests still fail (exit code is non-zero or completed is False), just in seconds rather than hours. Fixes NVIDIA#2004

rwgk · 2026-05-21T16:24:34Z

Thanks @aryanputta for jumping in! I'm running a Cursor code review right now...

When Python 3.12 CI runs, env-vars enables compute-sanitizer with --target-processes=all, which attaches to every mp.Process child the tests spawn. On CUDA 12.9.1 the sanitizer analysis of IPC buffer teardown gets stuck, so child processes never exit. The existing join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive. That zombie keeps its IPC handle open. When pytest teardown runs ipc_memory_resource's mr.close(), it blocks waiting for the handle to be released -- tying up the runner for hours until GitHub Actions force-cancels the job. This is the exact pattern in issue NVIDIA#2004 (always Python 3.12 + CUDA 12.9.1 local). Fix: after join(timeout=...), kill any process still alive so the IPC handle is released before fixture teardown. A stderr warning is printed when kill() fires so the failure is clearly attributable to the sanitizer/IPC deadlock rather than appearing as a generic exitcode != 0. Tests still fail (exit code is non-zero or completed is False), just in seconds rather than hours. Fixes NVIDIA#2004

rwgk

PR 2121 Initial Review

No blocking findings.

The change matches the issue #2004 observed failure mode: after a child process
exceeds join(timeout=30), it is now killed before parent-side teardown can
block on IPC handles. The TestIpcReexport path also now asserts completed,
so a timed-out child becomes a real fast failure instead of quietly continuing
into teardown.

Residual risks

Other memory_ipc tests also use join(timeout=CHILD_TIMEOUT_SEC) without
killing children. The evidence in issue 2004 points specifically at
test_send_buffers.py, so I would not block this PR on broadening the fix,
but it may be worth a follow-up helper if hangs reappear elsewhere.
Full CI has not run yet because the external-contributor PR is awaiting
validation. Current checks are only metadata/pre-commit.
I could not run the focused test locally because TestVenv appears stale for
this branch and fails importing cuda.core._memory._managed_buffer.
git diff --check passed.

rwgk · 2026-05-21T16:31:42Z

Oh - @aryanputta please don't force push.

rwgk · 2026-05-21T16:32:03Z

/ok to test 036a8f5

aryanputta · 2026-05-21T16:38:36Z

Oh - @aryanputta please don't force push.

sorry!!

rwgk · 2026-05-21T16:42:43Z

Oh - @aryanputta please don't force push.

sorry!!

No worries. It was only very slightly confusing.

Holding my breath that the two troublesome jobs (under PR #2118) will sail through now.

aryanputta · 2026-05-21T16:51:06Z

Oh - @aryanputta please don't force push.

sorry!!

No worries. It was only very slightly confusing.

Holding my breath that the two troublesome jobs (under PR #2118) will sail through now.

haha, me too, I always test locally aswell, i got no issues so hopefully its the same..

edit, did i stop it by accdient, i clicked comment and it think i clicked stop.. I am so Sorry

github-actions · 2026-05-21T16:51:13Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-2121/
https://nvidia.github.io/cuda-python/pr-preview/pr-2121/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-2121/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-2121/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rwgk · 2026-05-21T17:02:08Z

CI / Test linux-64 / Python 3.12, CUDA 13.2.1 (wheels), GPU rtx4090, wsl (push)

seems to be a flake. I'll keep an eye out here and will rerun when the rest has finished.

rwgk · 2026-05-21T17:45:28Z

This still seems to hang unfortunately:

CI / Test linux-64 / Python 3.12, CUDA 12.9.1 (local), GPU l4 (push)

@Andy-Jost for visibility

@aryanputta FYI: Andy is experimenting under #2124 with broader fixes.

rwgk · 2026-05-21T17:51:26Z

I cancelled the hanging job. Waiting for @Andy-Jost's experiments under #2124.

The fix in the initial commit only covered test_send_buffers.py. The 63-minute CI hang confirmed the deadlock was occurring in one of the other test files. Applies the same kill() after join(timeout=...) pattern to all remaining memory_ipc tests: - test_errors.py (1 join) - test_event_ipc.py (2 joins) - test_ipc_duplicate_import.py (1 join) - test_leaks.py (3 joins) - test_memory_ipc.py (7 joins across 4 test classes) - test_peer_access.py (2 joins) - test_serialize.py (3 joins) See issue NVIDIA#2004.

aryanputta · 2026-05-21T18:01:01Z

hey @rwgk @Andy-Jost -- after seeing the 63min hang in ci (step 29 died in the cuda.core test run) i checked the rest of the memory_ipc suite and found 7 more files with bare join(timeout=CHILD_TIMEOUT_SEC) and no kill after. the original commit here only covered test_send_buffers.py so the hang just moved to the next unprotected test.

just pushed a second commit that extends the same kill-after-join pattern to all remaining files:

test_memory_ipc.py (7 joins across 4 classes)
test_leaks.py (3 joins)
test_serialize.py (3 joins)
test_event_ipc.py (2 joins)
test_peer_access.py (2 joins)
test_errors.py (1 join)
test_ipc_duplicate_import.py (1 join)

andy's layered approach in #2124 is the cleaner fix so please prioritize that -- but if you need a fallback while #2124 is still in draft, this pr now covers the full suite. no need to run ci here unless #2124 doesn't work out

The kill() calls print to sys.stderr -- these three files were missing the import and would have raised NameError at runtime under the hang path.

Make every IPC test responsible for its own child-process cleanup, rather than leaning on the fixture-level tracker as the primary mechanism. helpers.child_processes.kill_subprocesses(*processes) returns the list of processes that were still alive when called and kills them. Tests pair this with an "assert not survivors" check, so a timeout produces a clean failure message ("timed out waiting on: ['Process-3']") instead of "assert None == 0", and the held IPC handles are released before any further test code runs. Every join+exitcode pattern in tests/memory_ipc/ is converted to the new shape: process.join(timeout=CHILD_TIMEOUT_SEC) survivors = kill_subprocesses(process) assert not survivors, "child did not exit within timeout" assert process.exitcode == 0 For test_send_buffers.py:TestIpcReexport (the test from NVIDIA#2121), the event.wait return value is also captured and asserted on so a timeout there is reported explicitly rather than dropped on the floor. The autouse track_child_processes() context manager in the ipc_device fixture remains as defense in depth for any future test that forgets the pattern. memory_ipc/test_errors.py adds a meta-test test_outer_timeout_marker_is_applied that verifies tests/memory_ipc/conftest.py is loaded and applies the pytest.mark.timeout(300) marker. This catches the "nested conftest silently not picked up" failure mode at test time with a clear error message.

Andy-Jost · 2026-05-21T20:57:54Z

@aryanputta Thank you very much for diagnosing the root cause of this. This led straight to #2124, which it appears will resolve the issue. Once that merges, we can close this PR.

aryanputta · 2026-05-21T21:14:23Z

@aryanputta Thank you very much for diagnosing the root cause of this. This led straight to #2124, which it appears will resolve the issue. Once that merges, we can close this PR.

glad to help!!

* tests: layered defense against IPC child-process hangs (#2004) Replace the per-test kill-on-timeout approach with three layered fixes that defend the IPC test suite against compute-sanitizer + CUDA driver / toolkit combinations that wedge child processes during IPC teardown. CI layer: ci/tools/setup-sanitizer now passes --target-processes=application-only so spawned multiprocessing.Process children run without the sanitizer attached. This eliminates the root cause of the hang reported in issue #2004 (compute-sanitizer's IPC teardown analysis getting stuck under Python 3.12 + CUDA 12.9.1). The parent pytest process is still fully sanitized. Fixture layer: ipc_device and ipc_mempool_device_x2 fixtures now wrap their yield in track_child_processes(), a context manager that patches multiprocessing.process.BaseProcess.__init__ to record every Process instance constructed during the test and kills any survivor at teardown. This protects ipc_memory_resource.mr.close() from blocking on IPC handles held by a stuck child, regardless of the original failure mode. Outer-guard layer: pytest-timeout is added to the test dep group, and tests/memory_ipc/conftest.py applies pytest.mark.timeout(300) to every test in the directory. This is the final fallback so no IPC test can wedge the GHA runner for hours if a new failure mode defeats the earlier layers. The hardcoded CHILD_TIMEOUT_SEC = 30 in every IPC test module is replaced with a call to helpers.child_processes.child_timeout_sec(), which returns 30 by default and 120 under compute-sanitizer (detected via the existing under_compute_sanitizer() helper). The unused CHILD_TIMEOUT_SEC declaration in test_workerpool.py is removed. * tests: reframe IPC hang docs as test-side bug rather than CS bug Reword the inline comment in ci/tools/setup-sanitizer and the docstrings in helpers/child_processes.py and memory_ipc/conftest.py to make clear that the underlying problem is insufficient guards in the IPC tests when child processes spawn slowly (>30s under compute-sanitizer). The sanitizer change and the new helpers are mitigations / safety nets; the durable fix is making the tests handle slow children correctly. * tests: add kill_subprocesses helper and apply to IPC tests Make every IPC test responsible for its own child-process cleanup, rather than leaning on the fixture-level tracker as the primary mechanism. helpers.child_processes.kill_subprocesses(*processes) returns the list of processes that were still alive when called and kills them. Tests pair this with an "assert not survivors" check, so a timeout produces a clean failure message ("timed out waiting on: ['Process-3']") instead of "assert None == 0", and the held IPC handles are released before any further test code runs. Every join+exitcode pattern in tests/memory_ipc/ is converted to the new shape: process.join(timeout=CHILD_TIMEOUT_SEC) survivors = kill_subprocesses(process) assert not survivors, "child did not exit within timeout" assert process.exitcode == 0 For test_send_buffers.py:TestIpcReexport (the test from #2121), the event.wait return value is also captured and asserted on so a timeout there is reported explicitly rather than dropped on the floor. The autouse track_child_processes() context manager in the ipc_device fixture remains as defense in depth for any future test that forgets the pattern. memory_ipc/test_errors.py adds a meta-test test_outer_timeout_marker_is_applied that verifies tests/memory_ipc/conftest.py is loaded and applies the pytest.mark.timeout(300) marker. This catches the "nested conftest silently not picked up" failure mode at test time with a clear error message. * tests: add memory_ipc/__init__.py to avoid conftest module collision Without __init__.py in tests/memory_ipc/, pytest imports tests/memory_ipc/conftest.py under module name `conftest`, which collides with the top-level tests/conftest.py. Tests under tests/ that use `from conftest import X` then resolve to the wrong file at import time, yielding ImportErrors like: ImportError: cannot import name 'skipif_need_cuda_headers' from 'conftest' (.../tests/memory_ipc/conftest.py) The fix mirrors tests/system/ (which already has both __init__.py and conftest.py): adding tests/memory_ipc/__init__.py makes memory_ipc a proper package, so its conftest is imported as memory_ipc.conftest and the top-level conftest stays under name `conftest`. * tests: scale memory_ipc outer timeout with child_timeout_sec Replace the fixed 300 s pytest.mark.timeout with 3 * child_timeout_sec(). This is the minimum that lets the worst-case test in the suite (currently TestIpcReexport, with three sequential CHILD_TIMEOUT_SEC waits in the failure path) reach its own asserts before the outer guard fires. Resulting budgets: - Without compute-sanitizer: 90 s (down from 300 s). - Under compute-sanitizer: 360 s (up from 300 s -- the previous value would have killed TestIpcReexport mid-second-join before its "process C did not signal completion within timeout" assertion could produce a clean failure message). The meta-test test_outer_timeout_marker_is_applied now computes the expected timeout from the same formula so the two stay in sync. * tests: tighten memory_ipc outer timeout to CHILD_TIMEOUT_SEC + 30 The previous 3 * CHILD_TIMEOUT_SEC scaling assumed worst-case wall-clock where every sequential join/wait hits its full timeout. In practice the children run concurrently, so expected wall-clock is ~CHILD_TIMEOUT_SEC regardless of how many joins the test chains -- once a child is done its join returns immediately. Exceeding CHILD_TIMEOUT_SEC + slack already means something is genuinely stuck, in which case the outer guard firing is the right outcome; the autouse track_child_processes() context manager still cleans up survivors, and the per-test diagnostic message would not be more informative than "test exceeded its budget". New budgets: - Without compute-sanitizer: 60 s (was 90 s). - Under compute-sanitizer: 150 s (was 360 s). The meta-test computes its expected value from the same formula.

Removed preview folders for the following PRs: - PR #2121

github-actions Bot added the cuda.core Everything related to the cuda.core module label May 21, 2026

aryanputta force-pushed the fix/ipc-test-zombie-hang branch from 72ccbe7 to df61e49 Compare May 21, 2026 16:10

rwgk requested a review from Andy-Jost May 21, 2026 16:17

rwgk assigned aryanputta May 21, 2026

rwgk added the P0 High priority - Must do! label May 21, 2026

rwgk added this to the cuda.core next milestone May 21, 2026

rwgk added the bug Something isn't working label May 21, 2026

rwgk self-requested a review May 21, 2026 16:23

aryanputta force-pushed the fix/ipc-test-zombie-hang branch from 948d30c to 5b802a0 Compare May 21, 2026 16:27

rwgk approved these changes May 21, 2026

View reviewed changes

Merge branch 'main' into fix/ipc-test-zombie-hang

036a8f5

rwgk mentioned this pull request May 21, 2026

nvbug-6193808: Work around mojibake in nvml.system_get_process_name on WSL #2118

Merged

aryanputta closed this May 21, 2026

aryanputta reopened this May 21, 2026

Andy-Jost mentioned this pull request May 21, 2026

tests: layered defense against IPC child-process hangs (#2004) #2124

Merged

tests: add missing sys import to event_ipc, dup_import, peer_access

4cbc531

The kill() calls print to sys.stderr -- these three files were missing the import and would have raised NameError at runtime under the hang path.

leofang closed this May 22, 2026

github-actions Bot pushed a commit that referenced this pull request May 22, 2026

Clean up PR preview folders for 1 closed/merged PRs

c945a7c

Removed preview folders for the following PRs: - PR #2121

Conversation

aryanputta commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Root cause

Fix

Related

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

rwgk commented May 21, 2026

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

PR 2121 Initial Review

Residual risks

Uh oh!

rwgk commented May 21, 2026

Uh oh!

rwgk commented May 21, 2026

Uh oh!

aryanputta commented May 21, 2026

Uh oh!

rwgk commented May 21, 2026

Uh oh!

aryanputta commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 21, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rwgk commented May 21, 2026

Uh oh!

rwgk commented May 21, 2026

Uh oh!

rwgk commented May 21, 2026

Uh oh!

aryanputta commented May 21, 2026

Uh oh!

Andy-Jost commented May 21, 2026

Uh oh!

aryanputta commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aryanputta commented May 21, 2026 •

edited

Loading

aryanputta commented May 21, 2026 •

edited

Loading