-
Notifications
You must be signed in to change notification settings - Fork 288
tests: layered defense against IPC child-process hangs (#2004) #2124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rwgk
merged 6 commits into
NVIDIA:main
from
Andy-Jost:ajost/ipc-test-layered-hang-guard
May 21, 2026
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
a026a05
tests: layered defense against IPC child-process hangs (#2004)
Andy-Jost d65186e
tests: reframe IPC hang docs as test-side bug rather than CS bug
Andy-Jost d1dbdff
tests: add kill_subprocesses helper and apply to IPC tests
Andy-Jost 63e17a9
tests: add memory_ipc/__init__.py to avoid conftest module collision
Andy-Jost fe5b914
tests: scale memory_ipc outer timeout with child_timeout_sec
Andy-Jost a476a2c
tests: tighten memory_ipc outer timeout to CHILD_TIMEOUT_SEC + 30
Andy-Jost File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """Helpers for tests that spawn ``multiprocessing.Process`` children. | ||
|
|
||
| These exist primarily to defend IPC tests against a class of CI hang where a | ||
| child process spawns too slowly and the parent does not implement proper guards | ||
| for that (see issue #2004). Without intervention, a zombie child holds an IPC | ||
| memory handle and blocks the parent's ``mr.close()`` in fixture teardown, | ||
| leading to deadlock and wedging the test runner for hours. | ||
| """ | ||
|
|
||
| import contextlib | ||
| import multiprocessing.process | ||
| import weakref | ||
|
|
||
| from cuda_python_test_helpers import under_compute_sanitizer | ||
|
|
||
| CHILD_TIMEOUT_SEC_DEFAULT = 30 | ||
| CHILD_TIMEOUT_SEC_SANITIZER = 120 | ||
|
|
||
|
|
||
| def child_timeout_sec() -> int: | ||
| """Return the per-process join/wait timeout for IPC-style tests. | ||
|
|
||
| Compute-sanitizer significantly slows process startup and CUDA context | ||
| teardown, so we use a larger budget when it is active. | ||
| """ | ||
| return CHILD_TIMEOUT_SEC_SANITIZER if under_compute_sanitizer() else CHILD_TIMEOUT_SEC_DEFAULT | ||
|
|
||
|
|
||
| def kill_subprocesses(*processes): | ||
| """Kill any of the given Process objects that are still alive. | ||
|
|
||
| Returns the list of processes that were killed (i.e. that were still alive | ||
| when the call was made). Callers should ``assert not survivors`` to convert | ||
| a non-empty return value into a clean test failure, e.g.:: | ||
|
|
||
| proc_a.join(timeout=CHILD_TIMEOUT_SEC) | ||
| proc_b.join(timeout=CHILD_TIMEOUT_SEC) | ||
| survivors = kill_subprocesses(proc_a, proc_b) | ||
| assert not survivors, f"timed out waiting on: {[p.name for p in survivors]}" | ||
| assert proc_a.exitcode == 0 | ||
| assert proc_b.exitcode == 0 | ||
|
|
||
| Killing survivors before the subsequent asserts prevents a zombie child | ||
| from holding IPC handles past the test body and blocking fixture | ||
| teardown. | ||
| """ | ||
| killed = [] | ||
| for proc in processes: | ||
| try: | ||
| alive = proc.is_alive() | ||
| except (ValueError, AssertionError): | ||
| # is_alive() raises if the Process was never started or has | ||
| # already been closed; nothing to clean up. | ||
| continue | ||
| if not alive: | ||
| continue | ||
| with contextlib.suppress(ValueError, AssertionError): | ||
| proc.kill() | ||
| proc.join() | ||
| killed.append(proc) | ||
| return killed | ||
|
|
||
|
|
||
| @contextlib.contextmanager | ||
| def track_child_processes(): | ||
| """Context manager that kills any ``multiprocessing.Process`` children still | ||
| alive at exit. | ||
|
|
||
| Patches ``multiprocessing.process.BaseProcess.__init__`` to record every | ||
| ``Process`` instance constructed inside the ``with`` block. This covers | ||
| the delegating ``mp.Process`` class as well as direct ``SpawnProcess`` / | ||
| ``ForkProcess`` instances (including those created by ``mp.Pool``), since | ||
| all of them inherit from ``BaseProcess``. On exit, any tracked process | ||
| that is still alive is killed and joined. | ||
|
|
||
| This protects fixture teardown (e.g. ``ipc_memory_resource``'s | ||
| ``mr.close()``) from blocking on IPC handles held by a stuck child -- | ||
| see issue #2004. | ||
| """ | ||
| tracked = weakref.WeakSet() | ||
| base = multiprocessing.process.BaseProcess | ||
| original_init = base.__init__ | ||
|
|
||
| def tracking_init(self, *args, **kwargs): | ||
| original_init(self, *args, **kwargs) | ||
| tracked.add(self) | ||
|
|
||
| base.__init__ = tracking_init | ||
| try: | ||
| yield | ||
| finally: | ||
| base.__init__ = original_init | ||
| for proc in list(tracked): | ||
| # is_alive() / kill() raise ValueError if the Process was never | ||
| # started or has already been closed; nothing to clean up in that | ||
| # case. | ||
| with contextlib.suppress(ValueError): | ||
| if proc.is_alive(): | ||
| proc.kill() | ||
| proc.join() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # | ||
| # SPDX-License-Identifier: Apache-2.0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """Per-directory conftest for memory IPC tests. | ||
|
|
||
| Applies an outer-guard ``pytest.mark.timeout`` to every test in this directory. | ||
| Individual tests still drive their own per-process waits using | ||
| ``child_timeout_sec()`` from ``helpers.child_processes``; this marker is the | ||
| final fallback so that no IPC test can wedge the CI runner for hours if | ||
| deadlock occurs. | ||
| """ | ||
|
|
||
| import pathlib | ||
|
|
||
| import pytest | ||
| from helpers.child_processes import child_timeout_sec | ||
|
|
||
| _HERE = pathlib.Path(__file__).parent.resolve() | ||
|
|
||
|
|
||
| def _outer_timeout_sec() -> int: | ||
| # IPC tests spawn children that run concurrently, so expected wall-clock | ||
| # is ~CHILD_TIMEOUT_SEC regardless of how many subsequent join/wait | ||
| # timeouts the test chains together (each subsequent join returns | ||
| # immediately once its child is already done). Exceeding that already | ||
| # means something is genuinely stuck, at which point the outer guard | ||
| # firing is the right outcome -- the per-test asserts wouldn't add | ||
| # useful diagnostic value over "test exceeded its budget", and the | ||
| # autouse track_child_processes() context manager still cleans up. | ||
| return child_timeout_sec() + 30 | ||
|
|
||
|
|
||
| def pytest_collection_modifyitems(config, items): | ||
| marker = pytest.mark.timeout(_outer_timeout_sec()) | ||
| for item in items: | ||
| try: | ||
| item_path = pathlib.Path(str(item.fspath)).resolve() | ||
| except OSError: | ||
| continue | ||
| if _HERE in item_path.parents: | ||
| item.add_marker(marker) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: since
ipc_memory_resourcedepends onipc_device, its teardown still runs before thistrack_child_processes()context exits. So this fixture-level guard may not protectmr.close()if a test fails before reaching its in-testkill_subprocesses()cleanup. The per-test cleanup andpytest-timeoutstill cover the known hang, so I don’t think this should block.