Skip to content

build: record physical plugin split experiment#643

Draft
voltjia wants to merge 1 commit into
build/manifest-cmake-entryfrom
build/external-plugin-source-paths
Draft

build: record physical plugin split experiment#643
voltjia wants to merge 1 commit into
build/manifest-cmake-entryfrom
build/external-plugin-source-paths

Conversation

@voltjia
Copy link
Copy Markdown
Collaborator

@voltjia voltjia commented Jun 4, 2026

Summary

Record a physical split experiment using an external plugin root outside the main repository.

This PR is stacked on #642 (build/manifest-cmake-entry).

Changes

  • Add docs/plugin_physical_split_experiment.md with the exact external plugin layout, configure/build commands, generated registry evidence, and findings.
  • Demonstrate a split external-cpu-add plugin under /tmp/infini-ops-physical-split/plugins that contributes an external ops/add/add.h implementation to one libinfiniops.so build.
  • Capture the current limitation that generated sources include absolute external paths, which works for local builds but should be hardened before long-term physical split support.

Validation

  • .venv/bin/python -m pytest -s -q tests/test_plugin_registry.py tests/test_plugin_test_matrix.py tests/test_generate_wrappers_plugins.py
  • .venv/bin/ruff check scripts/infini_ops_plugin_registry.py scripts/infini_ops_plugin_test_matrix.py scripts/generate_wrappers.py tests/test_plugin_registry.py tests/test_plugin_test_matrix.py tests/test_generate_wrappers_plugins.py
  • .venv/bin/python -m py_compile scripts/infini_ops_plugin_registry.py scripts/infini_ops_plugin_test_matrix.py scripts/generate_wrappers.py
  • git diff --check
  • Physical split external plugin configure/codegen/build smoke

Full-platform validation

Full stack validation was run on the top branch build/external-plugin-source-paths at pre-style-rebase commit 907eff70; see #643 validation comment: #643 (comment). The later rebase only updates PR metadata and diagnostic wording for CONTRIBUTING.md compliance.

  • NVIDIA: build passed; pytest failed with 1 failed, 3687 passed, 4404 skipped due to CUDA OOM in tests/test_torch_ops.py::test_op[..., svd] on the PyTorch reference path.
  • MetaX: 3183 passed, 3400 skipped.
  • Iluvatar: 2689 passed, 3894 skipped.
  • Moore: passed after preloading the container OpenMP runtime, 2949 passed, 3643 skipped.
  • Cambricon: 1781 passed, 4694 skipped.
  • Ascend: pytest passed with 3359 passed, 3233 skipped, but the container exited with code 137; treat as not fully green by the quality gate.

@voltjia
Copy link
Copy Markdown
Collaborator Author

voltjia commented Jun 4, 2026

Full-Platform Validation for build/external-plugin-source-paths

Validated commit 907eff70 from PR #643 across the requested platform set. Source was synced to remote scratch directories and each platform ran a fresh wheel build plus full pytest -v with JUnit output in the platform CI container. PyTorch backend was enabled by the platform build auto-detection, and tests/test_torch_ops.py was included in collection/execution on all platforms.

Platform Build pytest result Time Notes
NVIDIA Passed Failed: 1 failed, 3687 passed, 4404 skipped build 992s, test log summary 25s Failure is tests/test_torch_ops.py::test_op[..., svd]; log shows CUDA OOM during the PyTorch reference path.
MetaX Passed Passed: 3183 passed, 3400 skipped build 1377s, test 87s Full pytest completed with exit code 0.
Iluvatar Passed Passed: 2689 passed, 3894 skipped build 704s, test 40s First attempt hit multi-backend auto-detect (WITH_NVIDIA + WITH_ILUVATAR); rerun used explicit WITH_ILUVATAR=ON and passed.
Moore Passed Passed: 2949 passed, 3643 skipped build 2070s, test 45s First attempt hit missing OpenMP runtime at import (__kmpc_for_static_fini); rerun preloaded the MUSA OpenMP runtime and passed.
Cambricon Passed Passed: 1781 passed, 4694 skipped build 2209s, test 33s Full pytest completed with exit code 0.
Ascend Passed Pytest passed: 3359 passed, 3233 skipped; container exit abnormal build 1053s, test 61s Reproduced twice: pytest completed and wrote JUnit, but Docker returned exit code 137 after the successful pytest summary, including a rerun without Docker memory limit. Not marked fully green because the process exit code is not 0.

Notes:

  • No exact accelerator IDs are included here.
  • NVIDIA and Ascend are findings under the quality gate: NVIDIA has a real pytest failure; Ascend has a successful pytest summary but non-zero container exit code.
  • The plugin stack itself built on all platforms that reached pytest; no manifest/plugin-registry/codegen-specific failure was observed in these logs.

@voltjia voltjia changed the title Record physical plugin split experiment build: record physical plugin split experiment Jun 5, 2026
@voltjia voltjia force-pushed the build/manifest-cmake-entry branch from e7ed421 to 9129246 Compare June 5, 2026 08:26
@voltjia voltjia force-pushed the build/external-plugin-source-paths branch from 907eff7 to b3b7d2c Compare June 5, 2026 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant