Skip to content

tests: layered defense against IPC child-process hangs (#2004)#2124

Merged
rwgk merged 6 commits into
NVIDIA:mainfrom
Andy-Jost:ajost/ipc-test-layered-hang-guard
May 21, 2026
Merged

tests: layered defense against IPC child-process hangs (#2004)#2124
rwgk merged 6 commits into
NVIDIA:mainfrom
Andy-Jost:ajost/ipc-test-layered-hang-guard

Conversation

@Andy-Jost
Copy link
Copy Markdown
Contributor

@Andy-Jost Andy-Jost commented May 21, 2026

Summary

Implements layered fixes that defend the IPC test suite against failures that wedge CI. With these layers in place, no IPC test can wedge the GHA runner for hours, even if a new failure mode emerges.

Deadlock was observed due to the following sequence:

  1. The parent process spawns a child and waits with a 30 second timeout.
  2. The child process fails to complete in 30 seconds and eventually blocks while attempting communication with the parent.
  3. The parent moves on, eventually blocking on mr.close() because the subprocess holds an open IPC handle.

This is a bug in the test source: the result of the parent's wait needs to be checked. #2121 fixes some of these tests; this PR generalizes the fix to every IPC test.

This was observed in Python 3.12 test runs, which run under compute-sanitizer, presumably because compute-sanitizer slows down process spawning.

Changes

CI layer. ci/tools/setup-sanitizer now passes --target-processes=application-only so spawned multiprocessing.Process children run without compute-sanitizer attached. This should improve the start-up time for spawned processes.

In-test layer. New helper helpers.child_processes.kill_subprocesses(*processes) kills any Process objects still alive at call time and returns them. Every process.join(timeout=...) site in tests/memory_ipc/ now follows the canonical pattern below, so a timeout produces a clean failure message ("timed out waiting on: ['Process-3']") instead of assert None == 0, and held IPC handles are released before any further test code runs:

process.join(timeout=CHILD_TIMEOUT_SEC)
survivors = kill_subprocesses(process)
assert not survivors, "child did not exit within timeout"
assert process.exitcode == 0

For test_send_buffers.py:TestIpcReexport (the test from #2121), the event_c.wait return value is also captured and asserted on, so a timeout there is reported explicitly instead of being dropped on the floor.

Fixture layer. ipc_device and ipc_mempool_device_x2 now wrap their yield in track_child_processes(), a context manager in helpers.child_processes that patches multiprocessing.process.BaseProcess.__init__ to record every Process instance constructed during the test and kills any survivor at teardown. This is defense in depth for tests that may still slip through the in-test layer — it protects ipc_memory_resource.mr.close() from blocking on IPC handles held by a stuck child, regardless of the original failure mode.

Outer-guard layer. pytest-timeout is added to the test dep group, and tests/memory_ipc/conftest.py applies pytest.mark.timeout(child_timeout_sec() + 30) to every test in the directory (60 s without compute-sanitizer, 150 s under it). IPC test children run concurrently, so a well-behaved test finishes in ~CHILD_TIMEOUT_SEC regardless of how many joins it chains; exceeding that already means something is genuinely stuck, at which point the outer guard firing is the right outcome (the autouse track_child_processes() context manager still cleans up survivors). A meta-test test_outer_timeout_marker_is_applied in tests/memory_ipc/test_errors.py verifies via request.node.get_closest_marker("timeout") that the per-directory conftest is being loaded and applies the marker with the expected value; if a future change breaks conftest discovery or the formula drifts, this test fails immediately with a clear message.

(tests/memory_ipc/ also gains an empty __init__.py so its conftest is imported as memory_ipc.conftest rather than colliding with the top-level conftest module that other tests reference via from conftest import ....)

Timeout helper. The hardcoded CHILD_TIMEOUT_SEC = 30 in every IPC test module is replaced with a call to helpers.child_processes.child_timeout_sec(), which returns 30 by default and 120 under compute-sanitizer. The unused CHILD_TIMEOUT_SEC declaration in test_workerpool.py is removed.

Test Coverage

  • The helpers are exercised by the existing IPC suite: every IPC test runs inside the track_child_processes() context manager via ipc_device (and ipc_mempool_device_x2), and every join/exitcode pair now goes through kill_subprocesses. Happy-path tests verify no regression; any test that hits the original hang will fail fast with a clear timeout message instead of wedging.
  • The meta-test test_outer_timeout_marker_is_applied will fail if tests/memory_ipc/conftest.py is not loaded or does not apply the timeout marker.
  • Local smoke tests confirm: kill_subprocesses returns a list of still-alive processes (after killing them with SIGKILL, exitcode -9), an empty list for cleanly-exited or never-started Process objects, and never raises on either kind. track_child_processes() correctly kills a child still alive at scope exit and leaves cleanly-exiting children untouched.
  • child_timeout_sec() returns 30 by default and 120 when CUDA_PYTHON_UNDER_SANITIZER=1 (or any of the other signals checked by under_compute_sanitizer()).

Related Work

Replace the per-test kill-on-timeout approach with three layered fixes that
defend the IPC test suite against compute-sanitizer + CUDA driver / toolkit
combinations that wedge child processes during IPC teardown.

CI layer: ci/tools/setup-sanitizer now passes
--target-processes=application-only so spawned multiprocessing.Process
children run without the sanitizer attached. This eliminates the root cause
of the hang reported in issue NVIDIA#2004 (compute-sanitizer's IPC teardown
analysis getting stuck under Python 3.12 + CUDA 12.9.1). The parent pytest
process is still fully sanitized.

Fixture layer: ipc_device and ipc_mempool_device_x2 fixtures now wrap their
yield in track_child_processes(), a context manager that patches
multiprocessing.process.BaseProcess.__init__ to record every Process
instance constructed during the test and kills any survivor at teardown.
This protects ipc_memory_resource.mr.close() from blocking on IPC handles
held by a stuck child, regardless of the original failure mode.

Outer-guard layer: pytest-timeout is added to the test dep group, and
tests/memory_ipc/conftest.py applies pytest.mark.timeout(300) to every test
in the directory. This is the final fallback so no IPC test can wedge the
GHA runner for hours if a new failure mode defeats the earlier layers.

The hardcoded CHILD_TIMEOUT_SEC = 30 in every IPC test module is replaced
with a call to helpers.child_processes.child_timeout_sec(), which returns
30 by default and 120 under compute-sanitizer (detected via the existing
under_compute_sanitizer() helper). The unused CHILD_TIMEOUT_SEC declaration
in test_workerpool.py is removed.
@Andy-Jost Andy-Jost added this to the cuda.core next milestone May 21, 2026
@Andy-Jost Andy-Jost added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels May 21, 2026
@Andy-Jost Andy-Jost self-assigned this May 21, 2026
@github-actions github-actions Bot added the CI/CD CI/CD infrastructure label May 21, 2026
@Andy-Jost Andy-Jost marked this pull request as draft May 21, 2026 17:28
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 21, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions

This comment has been minimized.

Andy-Jost added 4 commits May 21, 2026 11:05
Reword the inline comment in ci/tools/setup-sanitizer and the docstrings in
helpers/child_processes.py and memory_ipc/conftest.py to make clear that the
underlying problem is insufficient guards in the IPC tests when child
processes spawn slowly (>30s under compute-sanitizer). The sanitizer change
and the new helpers are mitigations / safety nets; the durable fix is making
the tests handle slow children correctly.
Make every IPC test responsible for its own child-process cleanup, rather
than leaning on the fixture-level tracker as the primary mechanism.

helpers.child_processes.kill_subprocesses(*processes) returns the list of
processes that were still alive when called and kills them. Tests pair this
with an "assert not survivors" check, so a timeout produces a clean failure
message ("timed out waiting on: ['Process-3']") instead of "assert None ==
0", and the held IPC handles are released before any further test code
runs.

Every join+exitcode pattern in tests/memory_ipc/ is converted to the new
shape:

    process.join(timeout=CHILD_TIMEOUT_SEC)
    survivors = kill_subprocesses(process)
    assert not survivors, "child did not exit within timeout"
    assert process.exitcode == 0

For test_send_buffers.py:TestIpcReexport (the test from NVIDIA#2121), the
event.wait return value is also captured and asserted on so a timeout there
is reported explicitly rather than dropped on the floor.

The autouse track_child_processes() context manager in the ipc_device
fixture remains as defense in depth for any future test that forgets the
pattern.

memory_ipc/test_errors.py adds a meta-test test_outer_timeout_marker_is_applied
that verifies tests/memory_ipc/conftest.py is loaded and applies the
pytest.mark.timeout(300) marker. This catches the "nested conftest silently
not picked up" failure mode at test time with a clear error message.
Without __init__.py in tests/memory_ipc/, pytest imports
tests/memory_ipc/conftest.py under module name `conftest`, which collides
with the top-level tests/conftest.py. Tests under tests/ that use
`from conftest import X` then resolve to the wrong file at import time,
yielding ImportErrors like:

    ImportError: cannot import name 'skipif_need_cuda_headers' from
    'conftest' (.../tests/memory_ipc/conftest.py)

The fix mirrors tests/system/ (which already has both __init__.py and
conftest.py): adding tests/memory_ipc/__init__.py makes memory_ipc a
proper package, so its conftest is imported as memory_ipc.conftest and
the top-level conftest stays under name `conftest`.
Replace the fixed 300 s pytest.mark.timeout with 3 * child_timeout_sec().
This is the minimum that lets the worst-case test in the suite (currently
TestIpcReexport, with three sequential CHILD_TIMEOUT_SEC waits in the
failure path) reach its own asserts before the outer guard fires.

Resulting budgets:
- Without compute-sanitizer: 90 s (down from 300 s).
- Under compute-sanitizer:  360 s (up from 300 s -- the previous value
  would have killed TestIpcReexport mid-second-join before its
  "process C did not signal completion within timeout" assertion could
  produce a clean failure message).

The meta-test test_outer_timeout_marker_is_applied now computes the
expected timeout from the same formula so the two stay in sync.
@Andy-Jost
Copy link
Copy Markdown
Contributor Author

/ok to test

The previous 3 * CHILD_TIMEOUT_SEC scaling assumed worst-case wall-clock
where every sequential join/wait hits its full timeout. In practice the
children run concurrently, so expected wall-clock is ~CHILD_TIMEOUT_SEC
regardless of how many joins the test chains -- once a child is done its
join returns immediately. Exceeding CHILD_TIMEOUT_SEC + slack already
means something is genuinely stuck, in which case the outer guard firing
is the right outcome; the autouse track_child_processes() context
manager still cleans up survivors, and the per-test diagnostic message
would not be more informative than "test exceeded its budget".

New budgets:
- Without compute-sanitizer:  60 s (was 90 s).
- Under compute-sanitizer:   150 s (was 360 s).

The meta-test computes its expected value from the same formula.
@Andy-Jost
Copy link
Copy Markdown
Contributor Author

/ok to test

@Andy-Jost Andy-Jost marked this pull request as ready for review May 21, 2026 19:23
@Andy-Jost Andy-Jost requested review from leofang and rwgk May 21, 2026 19:24
Copy link
Copy Markdown
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One caveat identified by GPT-5.5. But the CI succeeds now, so it can wait until later, or if we actually see problems.


return _require_ipc_mempool_devices((device,))[0]
device = _require_ipc_mempool_devices((device,))[0]
with track_child_processes():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: since ipc_memory_resource depends on ipc_device, its teardown still runs before this track_child_processes() context exits. So this fixture-level guard may not protect mr.close() if a test fails before reaching its in-test kill_subprocesses() cleanup. The per-test cleanup and pytest-timeout still cover the known hang, so I don’t think this should block.

@rwgk rwgk merged commit ae34e4c into NVIDIA:main May 21, 2026
100 checks passed
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

I merged: I'll update #2118 next.

@github-actions
Copy link
Copy Markdown

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working CI/CD CI/CD infrastructure cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memory_ipc/test_send_buffers.py::TestIpcReexport::test_main[DeviceMR] hanging in CI

2 participants