Skip to content

fix: fix various icclrun bugs and enhance its stability#35

Merged
Ziminli merged 6 commits into
masterfrom
fix/fix-icclrun-bugs
Jun 5, 2026
Merged

fix: fix various icclrun bugs and enhance its stability#35
Ziminli merged 6 commits into
masterfrom
fix/fix-icclrun-bugs

Conversation

@Ziminli
Copy link
Copy Markdown
Collaborator

@Ziminli Ziminli commented Jun 5, 2026

Summary

This PR improves the stability and behavior of icclrun, ensures entries in cluster.yaml such as environment variables and CMake flags are handled correctly, and cleans up obsolete scripts. It also updates the cluster.yaml template to clarify usage.

Changes

  • icclrun Stability and Environment Handling

    • Conditional logic in generated scripts changed from platform-based to IP-based/host-based;
    • Path-like environment variables merged using append strategy instead of override;
    • Correct handling of install_dir between build and runtime;
    • Fix expansion of ~ in path variables;
    • Support global cmake_flags in cluster.yaml applied to all nodes unless overridden.
  • Refactor / Cleanup

    • Remove obsolete scripts/run_wrapper.sh since it is no longer useful.
  • Documentation Update

    • Updated examples/cluster.yaml template
      • Clarified behavior of install_dir, common_user, and backend_env;
      • Moved nodes section below global entries;
      • Added global cmake_flags entry;
      • Added node-specific user example

Platform and Backend Affected

Platform

  • N/A- CPU
  • N/A- NVIDIA GPU
  • N/A- Iluvatar GPU
  • N/A- MetaX GPU
  • N/A- Moore Threads GPU
  • N/A- Cambricon MLU

Backend

  • OpenMPI
  • MPICH

Performance Impact

  • No performance impact
  • Performance improved
  • Performance regression possible

N/A.

Known Issues & Future Work

  • Currently, cluster.yaml does not support specifying custom port numbers. Each node is assumed to have passwordless SSH access (e.g., via ssh <IP>) configured in .ssh/config. Port support may be added in future work.

Test Results

Test Involved Platform

  • CPU
  • NVIDIA GPU
  • Iluvatar GPU
  • MetaX GPU
  • Moore Threads GPU
  • Cambricon MLU

Test Involved Backend

  • OpenMPI
  • MPICH

NVIDIA + MetaX Heterogeneous Cluster:
all_gather.log
all_reduce.log
all_to_all.log
broadcast.log
gather.log
reduce.log
reduce_scatter.log
send_recv.log


Checklist

Every contributor must verify every item below before requesting
review. Tick each box only after the check has actually been performed —
do not tick speculatively. If an item truly does not apply, replace the
checkbox with N/A and briefly explain why in an inline comment.

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat: …, fix(nccl): …).
  • Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
  • No stray merge commits from master — the branch is rebased cleanly on top of the current master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal — no unrelated modifications were introduced (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene

  • The code is self-explanatory; comments were added only where the intent or rationale is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, inconsistent indentation, or mixed formatting styles remain.
  • Identifiers referenced in comments or error messages are wrapped in Markdown backticks (e.g. the `AllReduce` implementation) (CONTRIBUTING.md §Code/General).
  • All comments and error messages are in English (CONTRIBUTING.md §Code/General).
  • Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

  • N/A- Code follows the Google C++ Style Guide strictly.
  • N/A- clang-format (version 16, per .github/workflows/clang-format.yml) has been run against all modified applicable files; the diff is clean.
  • N/A- No exceptions are thrown. Error paths use assert with messages that include at least __FILE__, __LINE__, and __func__ (CONTRIBUTING.md §C++).
  • N/A- Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
  • N/A- Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
  • N/A- Exactly one blank line between classes, between classes and functions, and between functions (CONTRIBUTING.md §C++).
  • N/A- Exactly one blank line between members (functions and variables) within a class (CONTRIBUTING.md §C++).
  • N/A- Exactly one blank line before and after the contents of a namespace (CONTRIBUTING.md §C++).

Python Specific (if Python files changed)

  • Code is PEP 8 compliant; ruff check passes cleanly on CI (see .github/workflows/ruff.yml).
  • ruff format --check passes cleanly — if not, run ruff format and commit the result.
  • Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
  • Framework-specific conventions (e.g. lowercase pytest.skip messages without terminal period) are honored where applicable (CONTRIBUTING.md §Python).
  • No blank line between the function signature and the body when there is no docstring or comment (CONTRIBUTING.md §Python).
  • A blank line is present before and after if, for, and similar control-flow statements (CONTRIBUTING.md §Python).
  • A blank line appears before each return, except when it directly follows a control-flow statement (CONTRIBUTING.md §Python).
  • Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
  • Type hints are added / kept consistent with the surrounding code.

Testing

  • All applicable example programs have been built and tested successfully on at least one supported heterogeneous cluster setup.

Build, CI, and Tooling

  • N/A- New backends or devices have been added to auto-detection in CMakeLists.txt under if(AUTO_DETECT_DEVICES) or to if(AUTO_DETECT_BACKENDS) if applicable.
  • Both CI workflows (clang-format.yml, ruff.yml) are green locally (or expected to be green on CI).

Documentation

  • README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
  • Any user-visible breaking change is called out explicitly under "Summary" and in the commit/PR title with a ! or BREAKING CHANGE: footer.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • N/A- Third-party code is license-compatible and attributed.
  • No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

Ziminli added 4 commits June 5, 2026 02:05
…which is applied to all the nodes if there's no node-specific overrides
 - change the conditional logic in the generated `run_wrapper.sh` from platform-based to IP-based
 - fix the `install_dir` inconsistency between build phase and `run_wrapper.sh` env export
 - fix the handling of `~` appearance in path variables
 - change Path-like env merging from override-based to append-based strategy
…e the entires and usage

 - update the explanations about the behavior of `install_dir`, `common_user`, and `backend_env`
 - move the `nodes` section below all the global entries
 - add the global `cmake_flags` entry
 - add the `user` entry under `nodes` to illustrate how to specify a node-specific user
@Ziminli Ziminli self-assigned this Jun 5, 2026
@Ziminli Ziminli merged commit f3a92e3 into master Jun 5, 2026
2 checks passed
@Ziminli Ziminli deleted the fix/fix-icclrun-bugs branch June 5, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant