Skip to content

tests: kill zombie IPC child processes after join timeout#2121

Closed
aryanputta wants to merge 5 commits into
NVIDIA:mainfrom
aryanputta:fix/ipc-test-zombie-hang
Closed

tests: kill zombie IPC child processes after join timeout#2121
aryanputta wants to merge 5 commits into
NVIDIA:mainfrom
aryanputta:fix/ipc-test-zombie-hang

Conversation

@aryanputta
Copy link
Copy Markdown
Contributor

@aryanputta aryanputta commented May 21, 2026

What

TestIpcSendBuffers and TestIpcReexport call process.join(timeout=30) to wait for child processes but never kill them if the join times out. This causes multi-hour CI hangs.

Root cause

ci/tools/env-vars enables compute-sanitizer (--target-processes=all) on Python 3.12 + local CTK + Linux — exactly the combination in every hanging run from issue #2004. With --target-processes=all, compute-sanitizer attaches to every mp.Process child. On CUDA 12.9.1 its IPC teardown analysis gets stuck, so the child never exits.

The parent's join(timeout=30) returns after 30 seconds, but the child is still alive. That zombie holds an open IPC memory handle. When pytest runs fixture teardown (ipc_memory_resource's mr.close()), it blocks on the held handle indefinitely. The runner gets tied up for 4–6 hours until GitHub Actions force-cancels the job.

Fix

After join(timeout=...), kill any process that is still alive. This forces IPC handle release before fixture teardown runs. The test still fails (exit code is non-zero, or completed is False), just in seconds instead of hours.

No behavior change when children exit cleanly.

Related

Closes #2004

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label May 21, 2026
When Python 3.12 CI runs, env-vars enables compute-sanitizer with
--target-processes=all, which attaches to every mp.Process child the
tests spawn. On CUDA 12.9.1 the sanitizer analysis of IPC buffer teardown
gets stuck, so child processes never exit. The existing
join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive.

That zombie keeps its IPC handle open. When pytest teardown runs
ipc_memory_resource's mr.close(), it blocks waiting for the handle to
be released — tying up the runner for hours until GitHub Actions
force-cancels the job. This is the exact pattern in issue NVIDIA#2004
(always Python 3.12 + CUDA 12.9.1 local).

Fix: after join(timeout=...), kill any process still alive so the IPC
handle is released before fixture teardown. Tests still fail (exit code
is non-zero or completed is False), just in seconds rather than hours.

Fixes NVIDIA#2004
@aryanputta aryanputta force-pushed the fix/ipc-test-zombie-hang branch from 72ccbe7 to df61e49 Compare May 21, 2026 16:10
@rwgk rwgk requested a review from Andy-Jost May 21, 2026 16:17
@rwgk rwgk added the P0 High priority - Must do! label May 21, 2026
@rwgk rwgk added this to the cuda.core next milestone May 21, 2026
@rwgk rwgk added the bug Something isn't working label May 21, 2026
@rwgk rwgk self-requested a review May 21, 2026 16:23
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

Thanks @aryanputta for jumping in! I'm running a Cursor code review right now...

When Python 3.12 CI runs, env-vars enables compute-sanitizer with
--target-processes=all, which attaches to every mp.Process child the
tests spawn. On CUDA 12.9.1 the sanitizer analysis of IPC buffer teardown
gets stuck, so child processes never exit. The existing
join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive.

That zombie keeps its IPC handle open. When pytest teardown runs
ipc_memory_resource's mr.close(), it blocks waiting for the handle to
be released -- tying up the runner for hours until GitHub Actions
force-cancels the job. This is the exact pattern in issue NVIDIA#2004
(always Python 3.12 + CUDA 12.9.1 local).

Fix: after join(timeout=...), kill any process still alive so the IPC
handle is released before fixture teardown. A stderr warning is printed
when kill() fires so the failure is clearly attributable to the
sanitizer/IPC deadlock rather than appearing as a generic exitcode != 0.
Tests still fail (exit code is non-zero or completed is False), just in
seconds rather than hours.

Fixes NVIDIA#2004
@aryanputta aryanputta force-pushed the fix/ipc-test-zombie-hang branch from 948d30c to 5b802a0 Compare May 21, 2026 16:27
Copy link
Copy Markdown
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 2121 Initial Review

No blocking findings.

The change matches the issue #2004 observed failure mode: after a child process
exceeds join(timeout=30), it is now killed before parent-side teardown can
block on IPC handles. The TestIpcReexport path also now asserts completed,
so a timed-out child becomes a real fast failure instead of quietly continuing
into teardown.

Residual risks

  • Other memory_ipc tests also use join(timeout=CHILD_TIMEOUT_SEC) without
    killing children. The evidence in issue 2004 points specifically at
    test_send_buffers.py, so I would not block this PR on broadening the fix,
    but it may be worth a follow-up helper if hangs reappear elsewhere.
  • Full CI has not run yet because the external-contributor PR is awaiting
    validation. Current checks are only metadata/pre-commit.
  • I could not run the focused test locally because TestVenv appears stale for
    this branch and fails importing cuda.core._memory._managed_buffer.
    git diff --check passed.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

Oh - @aryanputta please don't force push.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

/ok to test 036a8f5

@aryanputta
Copy link
Copy Markdown
Contributor Author

Oh - @aryanputta please don't force push.

sorry!!

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

Oh - @aryanputta please don't force push.

sorry!!

No worries. It was only very slightly confusing.

Holding my breath that the two troublesome jobs (under PR #2118) will sail through now.

@aryanputta
Copy link
Copy Markdown
Contributor Author

aryanputta commented May 21, 2026

Oh - @aryanputta please don't force push.

sorry!!

No worries. It was only very slightly confusing.

Holding my breath that the two troublesome jobs (under PR #2118) will sail through now.

haha, me too, I always test locally aswell, i got no issues so hopefully its the same..

edit, did i stop it by accdient, i clicked comment and it think i clicked stop.. I am so Sorry

@aryanputta aryanputta closed this May 21, 2026
@github-actions
Copy link
Copy Markdown

@aryanputta aryanputta reopened this May 21, 2026
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

CI / Test linux-64 / Python 3.12, CUDA 13.2.1 (wheels), GPU rtx4090, wsl (push)

seems to be a flake. I'll keep an eye out here and will rerun when the rest has finished.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

This still seems to hang unfortunately:

CI / Test linux-64 / Python 3.12, CUDA 12.9.1 (local), GPU l4 (push)

@Andy-Jost for visibility

@aryanputta FYI: Andy is experimenting under #2124 with broader fixes.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

I cancelled the hanging job. Waiting for @Andy-Jost's experiments under #2124.

The fix in the initial commit only covered test_send_buffers.py. The
63-minute CI hang confirmed the deadlock was occurring in one of the
other test files. Applies the same kill() after join(timeout=...) pattern
to all remaining memory_ipc tests:

- test_errors.py (1 join)
- test_event_ipc.py (2 joins)
- test_ipc_duplicate_import.py (1 join)
- test_leaks.py (3 joins)
- test_memory_ipc.py (7 joins across 4 test classes)
- test_peer_access.py (2 joins)
- test_serialize.py (3 joins)

See issue NVIDIA#2004.
@aryanputta
Copy link
Copy Markdown
Contributor Author

hey @rwgk @Andy-Jost -- after seeing the 63min hang in ci (step 29 died in the cuda.core test run) i checked the rest of the memory_ipc suite and found 7 more files with bare join(timeout=CHILD_TIMEOUT_SEC) and no kill after. the original commit here only covered test_send_buffers.py so the hang just moved to the next unprotected test.

just pushed a second commit that extends the same kill-after-join pattern to all remaining files:

  • test_memory_ipc.py (7 joins across 4 classes)
  • test_leaks.py (3 joins)
  • test_serialize.py (3 joins)
  • test_event_ipc.py (2 joins)
  • test_peer_access.py (2 joins)
  • test_errors.py (1 join)
  • test_ipc_duplicate_import.py (1 join)

andy's layered approach in #2124 is the cleaner fix so please prioritize that -- but if you need a fallback while #2124 is still in draft, this pr now covers the full suite. no need to run ci here unless #2124 doesn't work out

The kill() calls print to sys.stderr -- these three files were missing
the import and would have raised NameError at runtime under the hang path.
Andy-Jost added a commit to Andy-Jost/cuda-python that referenced this pull request May 21, 2026
Make every IPC test responsible for its own child-process cleanup, rather
than leaning on the fixture-level tracker as the primary mechanism.

helpers.child_processes.kill_subprocesses(*processes) returns the list of
processes that were still alive when called and kills them. Tests pair this
with an "assert not survivors" check, so a timeout produces a clean failure
message ("timed out waiting on: ['Process-3']") instead of "assert None ==
0", and the held IPC handles are released before any further test code
runs.

Every join+exitcode pattern in tests/memory_ipc/ is converted to the new
shape:

    process.join(timeout=CHILD_TIMEOUT_SEC)
    survivors = kill_subprocesses(process)
    assert not survivors, "child did not exit within timeout"
    assert process.exitcode == 0

For test_send_buffers.py:TestIpcReexport (the test from NVIDIA#2121), the
event.wait return value is also captured and asserted on so a timeout there
is reported explicitly rather than dropped on the floor.

The autouse track_child_processes() context manager in the ipc_device
fixture remains as defense in depth for any future test that forgets the
pattern.

memory_ipc/test_errors.py adds a meta-test test_outer_timeout_marker_is_applied
that verifies tests/memory_ipc/conftest.py is loaded and applies the
pytest.mark.timeout(300) marker. This catches the "nested conftest silently
not picked up" failure mode at test time with a clear error message.
@Andy-Jost
Copy link
Copy Markdown
Contributor

@aryanputta Thank you very much for diagnosing the root cause of this. This led straight to #2124, which it appears will resolve the issue. Once that merges, we can close this PR.

@aryanputta
Copy link
Copy Markdown
Contributor Author

@aryanputta Thank you very much for diagnosing the root cause of this. This led straight to #2124, which it appears will resolve the issue. Once that merges, we can close this PR.

glad to help!!

rwgk pushed a commit that referenced this pull request May 21, 2026
* tests: layered defense against IPC child-process hangs (#2004)

Replace the per-test kill-on-timeout approach with three layered fixes that
defend the IPC test suite against compute-sanitizer + CUDA driver / toolkit
combinations that wedge child processes during IPC teardown.

CI layer: ci/tools/setup-sanitizer now passes
--target-processes=application-only so spawned multiprocessing.Process
children run without the sanitizer attached. This eliminates the root cause
of the hang reported in issue #2004 (compute-sanitizer's IPC teardown
analysis getting stuck under Python 3.12 + CUDA 12.9.1). The parent pytest
process is still fully sanitized.

Fixture layer: ipc_device and ipc_mempool_device_x2 fixtures now wrap their
yield in track_child_processes(), a context manager that patches
multiprocessing.process.BaseProcess.__init__ to record every Process
instance constructed during the test and kills any survivor at teardown.
This protects ipc_memory_resource.mr.close() from blocking on IPC handles
held by a stuck child, regardless of the original failure mode.

Outer-guard layer: pytest-timeout is added to the test dep group, and
tests/memory_ipc/conftest.py applies pytest.mark.timeout(300) to every test
in the directory. This is the final fallback so no IPC test can wedge the
GHA runner for hours if a new failure mode defeats the earlier layers.

The hardcoded CHILD_TIMEOUT_SEC = 30 in every IPC test module is replaced
with a call to helpers.child_processes.child_timeout_sec(), which returns
30 by default and 120 under compute-sanitizer (detected via the existing
under_compute_sanitizer() helper). The unused CHILD_TIMEOUT_SEC declaration
in test_workerpool.py is removed.

* tests: reframe IPC hang docs as test-side bug rather than CS bug

Reword the inline comment in ci/tools/setup-sanitizer and the docstrings in
helpers/child_processes.py and memory_ipc/conftest.py to make clear that the
underlying problem is insufficient guards in the IPC tests when child
processes spawn slowly (>30s under compute-sanitizer). The sanitizer change
and the new helpers are mitigations / safety nets; the durable fix is making
the tests handle slow children correctly.

* tests: add kill_subprocesses helper and apply to IPC tests

Make every IPC test responsible for its own child-process cleanup, rather
than leaning on the fixture-level tracker as the primary mechanism.

helpers.child_processes.kill_subprocesses(*processes) returns the list of
processes that were still alive when called and kills them. Tests pair this
with an "assert not survivors" check, so a timeout produces a clean failure
message ("timed out waiting on: ['Process-3']") instead of "assert None ==
0", and the held IPC handles are released before any further test code
runs.

Every join+exitcode pattern in tests/memory_ipc/ is converted to the new
shape:

    process.join(timeout=CHILD_TIMEOUT_SEC)
    survivors = kill_subprocesses(process)
    assert not survivors, "child did not exit within timeout"
    assert process.exitcode == 0

For test_send_buffers.py:TestIpcReexport (the test from #2121), the
event.wait return value is also captured and asserted on so a timeout there
is reported explicitly rather than dropped on the floor.

The autouse track_child_processes() context manager in the ipc_device
fixture remains as defense in depth for any future test that forgets the
pattern.

memory_ipc/test_errors.py adds a meta-test test_outer_timeout_marker_is_applied
that verifies tests/memory_ipc/conftest.py is loaded and applies the
pytest.mark.timeout(300) marker. This catches the "nested conftest silently
not picked up" failure mode at test time with a clear error message.

* tests: add memory_ipc/__init__.py to avoid conftest module collision

Without __init__.py in tests/memory_ipc/, pytest imports
tests/memory_ipc/conftest.py under module name `conftest`, which collides
with the top-level tests/conftest.py. Tests under tests/ that use
`from conftest import X` then resolve to the wrong file at import time,
yielding ImportErrors like:

    ImportError: cannot import name 'skipif_need_cuda_headers' from
    'conftest' (.../tests/memory_ipc/conftest.py)

The fix mirrors tests/system/ (which already has both __init__.py and
conftest.py): adding tests/memory_ipc/__init__.py makes memory_ipc a
proper package, so its conftest is imported as memory_ipc.conftest and
the top-level conftest stays under name `conftest`.

* tests: scale memory_ipc outer timeout with child_timeout_sec

Replace the fixed 300 s pytest.mark.timeout with 3 * child_timeout_sec().
This is the minimum that lets the worst-case test in the suite (currently
TestIpcReexport, with three sequential CHILD_TIMEOUT_SEC waits in the
failure path) reach its own asserts before the outer guard fires.

Resulting budgets:
- Without compute-sanitizer: 90 s (down from 300 s).
- Under compute-sanitizer:  360 s (up from 300 s -- the previous value
  would have killed TestIpcReexport mid-second-join before its
  "process C did not signal completion within timeout" assertion could
  produce a clean failure message).

The meta-test test_outer_timeout_marker_is_applied now computes the
expected timeout from the same formula so the two stay in sync.

* tests: tighten memory_ipc outer timeout to CHILD_TIMEOUT_SEC + 30

The previous 3 * CHILD_TIMEOUT_SEC scaling assumed worst-case wall-clock
where every sequential join/wait hits its full timeout. In practice the
children run concurrently, so expected wall-clock is ~CHILD_TIMEOUT_SEC
regardless of how many joins the test chains -- once a child is done its
join returns immediately. Exceeding CHILD_TIMEOUT_SEC + slack already
means something is genuinely stuck, in which case the outer guard firing
is the right outcome; the autouse track_child_processes() context
manager still cleans up survivors, and the per-test diagnostic message
would not be more informative than "test exceeded its budget".

New budgets:
- Without compute-sanitizer:  60 s (was 90 s).
- Under compute-sanitizer:   150 s (was 360 s).

The meta-test computes its expected value from the same formula.
@leofang leofang closed this May 22, 2026
github-actions Bot pushed a commit that referenced this pull request May 22, 2026
Removed preview folders for the following PRs:
- PR #2121
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memory_ipc/test_send_buffers.py::TestIpcReexport::test_main[DeviceMR] hanging in CI

4 participants