Skip to content

feat(fm9g): integrate fused FFN op into csrc decoder layer#409

Open
ZhouBencheng wants to merge 1 commit into
InfiniTensor:mainfrom
ZhouBencheng:feat/fused-ffn-fm9g
Open

feat(fm9g): integrate fused FFN op into csrc decoder layer#409
ZhouBencheng wants to merge 1 commit into
InfiniTensor:mainfrom
ZhouBencheng:feat/fused-ffn-fm9g

Conversation

@ZhouBencheng
Copy link
Copy Markdown

Substitute the post-attention rms_norm + MLP block on FM9G with the InfiniCore fused-FFN op. The substitution is gated per forward() call by the INFINILM_USE_FUSED_FFN env var so a single process can interleave fused and non-fused passes — required for clean benchmarks. When MuP scaling on down_proj is active (alpha != 1.0), the per-op path is taken so the multiplier the fused kernel does not model is preserved.

The fused kernel accepts only rank-2 [ntok, hidden] tensors, while the engine carries hidden_states as [batch, seq_len, hidden]; the call site views to 2-D and back.

Expose gate_up_weight(), down_weight(), and down_alpha() on the shared MLP module so the FM9G decoder layer no longer needs to reach into protected members.

Add test/bench/bench_fused_ffn_csrc.py: meaningful Chinese prompts, interleaved NF/F rounds, time.perf_counter window around engine.generate(), optional correctness verification, markdown report, and a --no-chat-template flag for the very-short-input regime where the fused path matters most.

Summary

Motivation

Closes #

Type of Change

  • feat — new feature / new model
  • fix — bug fix
  • perf — performance improvement (no behavioral change)
  • refactor — code restructuring without behavior change
  • test — adding or fixing tests only
  • docs — documentation only
  • build / ci — build system or CI configuration
  • chore — tooling, formatting, or other non-code changes
  • Breaking change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Benchmark / Performance Impact

Notes for Reviewers


Checklist

Every contributor must verify every item below before requesting
review. Tick each box only after the check has actually been performed —
do not tick speculatively. If an item truly does not apply, replace the
checkbox with N/A and briefly explain why in an inline comment.

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
  • Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
  • No stray merge commits from main — the branch is rebased cleanly on top of the current main.
  • No fixup! / squash! / wip commits remain.
  • Existing PR/branch/commit that followed the legacy issue format.

Scope and Design

  • Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene (applies to all languages)

  • The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
  • All comments and error messages are in English (CONTRIBUTING.md §Code/General).
  • Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

  • Code follows the Google C++ Style Guide strictly.
  • Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
  • Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
  • No raw new/delete; RAII / smart pointers / existing allocators are used.
  • Changed files are formatted by scripts/format.py.
  • No changes/reference to csrc/models/llama_legacy/.

Python Specific (if Python files changed)

  • Code is PEP 8 compliant.
  • Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
  • Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
  • Changed files are formatted by scripts/format.py.
  • No changes/reference to python/infinilm/auto_config.py.

Testing

  • For any platform that could not be tested, an explicit reason is given in the table and a reviewer with access has been tagged.
  • Passed single request test (examples/test_infer.py), or specify the reason for skipping.
  • Passed offline performance test (examples/bench.py), or specify the reason for skipping.
  • Passed sanity test (test/bench/test_benchmark.py), or specify the reason for skipping.
  • Passed service test (python/infinilm/server/inference_server.py + scripts/test_perf.py), or specify the reason for skipping.

Build, CI, and Tooling

  • The project builds cleanly from a fresh directory on at least one affected platform.

Documentation

  • README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
  • Any user-visible breaking change is called out explicitly under "Motivation" and in the commit/PR title with a ! or BREAKING CHANGE: footer.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • Third-party code is license-compatible and attributed.
  • No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

Substitute the post-attention `rms_norm` + MLP block on FM9G with the
InfiniCore fused-FFN op. The substitution is gated per `forward()` call
by the `INFINILM_USE_FUSED_FFN` env var so a single process can
interleave fused and non-fused passes — required for clean benchmarks.
When MuP scaling on `down_proj` is active (`alpha != 1.0`), the per-op
path is taken so the multiplier the fused kernel does not model is
preserved.

The fused kernel accepts only rank-2 `[ntok, hidden]` tensors, while
the engine carries `hidden_states` as `[batch, seq_len, hidden]`; the
call site views to 2-D and back.

Expose `gate_up_weight()`, `down_weight()`, and `down_alpha()` on the
shared `MLP` module so the FM9G decoder layer no longer needs to reach
into protected members.

Add `test/bench/bench_fused_ffn_csrc.py`: meaningful Chinese prompts,
interleaved NF/F rounds, `time.perf_counter` window around
`engine.generate()`, optional correctness verification, markdown
report, and a `--no-chat-template` flag for the very-short-input
regime where the fused path matters most.
@ZhouBencheng ZhouBencheng requested a review from a team June 3, 2026 13:50
Copy link
Copy Markdown
Collaborator

@wooway777 wooway777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

麻烦先提供以下截图:

  1. 正常说话
  2. 分布式
  3. 性能对比

另外this->mlp_->down_alpha() != 1.0f这个条件九格能满足么?九格应该大多不是1.0

代码怎么调整后面再看看。感觉应该可以做成通用模块,最好在text decoder layers里完成判断。

以及建议用xmake参数作为开关,参考kv caching。而非读取环境变量。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants