feat(fm9g): integrate fused FFN op into csrc decoder layer#409
Open
ZhouBencheng wants to merge 1 commit into
Open
feat(fm9g): integrate fused FFN op into csrc decoder layer#409ZhouBencheng wants to merge 1 commit into
ZhouBencheng wants to merge 1 commit into
Conversation
Substitute the post-attention `rms_norm` + MLP block on FM9G with the InfiniCore fused-FFN op. The substitution is gated per `forward()` call by the `INFINILM_USE_FUSED_FFN` env var so a single process can interleave fused and non-fused passes — required for clean benchmarks. When MuP scaling on `down_proj` is active (`alpha != 1.0`), the per-op path is taken so the multiplier the fused kernel does not model is preserved. The fused kernel accepts only rank-2 `[ntok, hidden]` tensors, while the engine carries `hidden_states` as `[batch, seq_len, hidden]`; the call site views to 2-D and back. Expose `gate_up_weight()`, `down_weight()`, and `down_alpha()` on the shared `MLP` module so the FM9G decoder layer no longer needs to reach into protected members. Add `test/bench/bench_fused_ffn_csrc.py`: meaningful Chinese prompts, interleaved NF/F rounds, `time.perf_counter` window around `engine.generate()`, optional correctness verification, markdown report, and a `--no-chat-template` flag for the very-short-input regime where the fused path matters most.
wooway777
reviewed
Jun 4, 2026
Collaborator
wooway777
left a comment
There was a problem hiding this comment.
麻烦先提供以下截图:
- 正常说话
- 分布式
- 性能对比
另外this->mlp_->down_alpha() != 1.0f这个条件九格能满足么?九格应该大多不是1.0
代码怎么调整后面再看看。感觉应该可以做成通用模块,最好在text decoder layers里完成判断。
以及建议用xmake参数作为开关,参考kv caching。而非读取环境变量。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Substitute the post-attention
rms_norm+ MLP block on FM9G with the InfiniCore fused-FFN op. The substitution is gated perforward()call by theINFINILM_USE_FUSED_FFNenv var so a single process can interleave fused and non-fused passes — required for clean benchmarks. When MuP scaling ondown_projis active (alpha != 1.0), the per-op path is taken so the multiplier the fused kernel does not model is preserved.The fused kernel accepts only rank-2
[ntok, hidden]tensors, while the engine carrieshidden_statesas[batch, seq_len, hidden]; the call site views to 2-D and back.Expose
gate_up_weight(),down_weight(), anddown_alpha()on the sharedMLPmodule so the FM9G decoder layer no longer needs to reach into protected members.Add
test/bench/bench_fused_ffn_csrc.py: meaningful Chinese prompts, interleaved NF/F rounds,time.perf_counterwindow aroundengine.generate(), optional correctness verification, markdown report, and a--no-chat-templateflag for the very-short-input regime where the fused path matters most.Summary
Motivation
Closes #
Type of Change
feat— new feature / new modelfix— bug fixperf— performance improvement (no behavioral change)refactor— code restructuring without behavior changetest— adding or fixing tests onlydocs— documentation onlybuild/ci— build system or CI configurationchore— tooling, formatting, or other non-code changesTest Results of Involved Models on Supported Platforms (Please attach screenshots)
Benchmark / Performance Impact
Notes for Reviewers
Checklist
Title, Branch, and Commits
feat(nvidia): …,fix(cuda/gemm): …).<type>/xxx-yyyy-zzzzwhere<type>matches the PR title's Conventional Commits type and words are joined with hyphens (seeCONTRIBUTING.md§Branches).CONTRIBUTING.md§Pull Requests).main— the branch is rebased cleanly on top of the currentmain.fixup!/squash!/wipcommits remain.Scope and Design
CONTRIBUTING.md§Code/General).printf/std::cout/print(...)left behind, orTODOwithout an owner and issue link.General Code Hygiene (applies to all languages)
CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General).the `seqlens_k` tensor) (CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General; §Python).C++ Specific (if C++ files changed)
CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).new/delete; RAII / smart pointers / existing allocators are used.scripts/format.py.csrc/models/llama_legacy/.Python Specific (if Python files changed)
CONTRIBUTING.md§Python).CONTRIBUTING.md§Python).scripts/format.py.python/infinilm/auto_config.py.Testing
examples/test_infer.py), or specify the reason for skipping.examples/bench.py), or specify the reason for skipping.test/bench/test_benchmark.py), or specify the reason for skipping.python/infinilm/server/inference_server.py+scripts/test_perf.py), or specify the reason for skipping.Build, CI, and Tooling
Documentation
README.md,CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.!orBREAKING CHANGE:footer.Security and Safety