Skip to content

Harden pinned NUMA mempool tests against constructor OOM flakes#2096

Merged
rwgk merged 1 commit into
NVIDIA:mainfrom
rwgk:nvbugs5815123_addl_create_pinned_memory_resource_or_xfail
May 16, 2026
Merged

Harden pinned NUMA mempool tests against constructor OOM flakes#2096
rwgk merged 1 commit into
NVIDIA:mainfrom
rwgk:nvbugs5815123_addl_create_pinned_memory_resource_or_xfail

Conversation

@rwgk
Copy link
Copy Markdown
Contributor

@rwgk rwgk commented May 16, 2026

This is a small follow-up to PR #2084.

The cuda_core_5.15.log attached to nvbug 5815123 shows two remaining intermittent cuda_core failures after the PR #2084 patch:

  • tests/test_memory.py::test_pinned_mr_numa_id_default_no_ipc
  • tests/test_memory.py::test_pinned_mr_numa_id_explicit

Both tests are meant to validate PinnedMemoryResource.numa_id behavior, but they instantiate PinnedMemoryResource(PinnedMemoryResourceOptions(...)) directly. Those constructor paths create a real pinned memory pool immediately, so they can still hit the same constructor-time CUDA_ERROR_OUT_OF_MEMORY failure mode that is already handled elsewhere in the Windows mempool workaround coverage.

This change routes those constructor sites through the existing create_pinned_memory_resource_or_xfail(...) helper. It also updates the nearby test_pinned_mr_numa_id_default_with_ipc case for consistency, since it exercises the same pool-creation path.

The goal is to keep these tests focused on NUMA-ID semantics instead of failing on the known Windows MCDM mempool-constructor flake.


xref: nvbug 5815123

Route pinned NUMA-ID constructor tests through the existing Windows MCDM mempool OOM helper so they stay focused on NUMA semantics instead of failing on the known constructor flake.
@rwgk rwgk added this to the cuda.core next milestone May 16, 2026
@rwgk rwgk self-assigned this May 16, 2026
@rwgk rwgk added P0 High priority - Must do! test Improvements or additions to tests cuda.core Everything related to the cuda.core module labels May 16, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 16, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 16, 2026

/ok to test

@github-actions

This comment has been minimized.

@rwgk rwgk marked this pull request as ready for review May 16, 2026 03:17
@rwgk rwgk requested a review from leofang May 16, 2026 03:21
@rwgk rwgk enabled auto-merge (squash) May 16, 2026 03:21
@leofang
Copy link
Copy Markdown
Member

leofang commented May 16, 2026

btw nightly also failed for the OOM issue (which I don't quite understand why would happen): https://github.com/NVIDIA/cuda-python/actions/runs/25947528269/job/76279804206#step:26:17561

Merging not because I agree we need to address cuda-core test issues during CTK bring-up, but because this actually blocks our CI and thus development, as seen in #2087.

@rwgk rwgk merged commit e481335 into NVIDIA:main May 16, 2026
177 of 178 checks passed
@github-actions
Copy link
Copy Markdown

Doc Preview CI
Preview removed because the pull request was closed or merged.

@rwgk rwgk deleted the nvbugs5815123_addl_create_pinned_memory_resource_or_xfail branch May 16, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module P0 High priority - Must do! test Improvements or additions to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants