Avoid small MatMul batch parameter heap allocations#29085
Open
GopalakrishnanN wants to merge 1 commit into
Open
Avoid small MatMul batch parameter heap allocations#29085GopalakrishnanN wants to merge 1 commit into
GopalakrishnanN wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces per-invocation heap allocation overhead in the CPU MatMul<float>::Compute() path by using InlinedVector for very small batched GEMM parameter arrays, keeping the common 1–2 GEMM cases allocation-free while preserving the existing std::vector behavior for larger batches.
Changes:
- Use
InlinedVector<..., 2>forMLAS_SGEMM_DATA_PARAMSwhenhelper.OutputOffsets().size() <= 2. - Apply the same small-batch inlining strategy for the aarch64/Linux fast-math
MLAS_SBGEMM_DATA_PARAMSpath. - Refactor batch setup into a shared lambda to avoid duplicating the fill+dispatch logic across container types.
| // Licensed under the MIT License. | ||
|
|
||
| #include "core/providers/cpu/math/matmul.h" | ||
| #include "core/common/inlined_containers.h" |
Comment on lines
+351
to
357
| if (max_len <= 2) { | ||
| InlinedVector<MLAS_SBGEMM_DATA_PARAMS, 2> data(max_len); | ||
| gemm_batch(data); | ||
| } else { | ||
| std::vector<MLAS_SBGEMM_DATA_PARAMS> data(max_len); | ||
| gemm_batch(data); | ||
| } |
Comment on lines
+377
to
383
| if (max_len <= 2) { | ||
| InlinedVector<MLAS_SGEMM_DATA_PARAMS, 2> data(max_len); | ||
| gemm_batch(data); | ||
| } else { | ||
| std::vector<MLAS_SGEMM_DATA_PARAMS> data(max_len); | ||
| gemm_batch(data); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Reduce per-call heap allocation overhead in the CPU
MatMul<float>kernel for the small batched-GEMM case.MatMul<float>::Compute()builds an array of MLAS batch-parameter structs before callingMlasGemmBatch/MlasSBGemmBatch. Previously this always usedstd::vector, so even the common single-GEMM path performed one heap allocation perCompute()call.This change uses stack-backed
InlinedVectorstorage whenhelper.OutputOffsets().size() <= 2, and keeps the existingstd::vectorpath for larger batches:MLAS_SGEMM_DATA_PARAMSandMLAS_SBGEMM_DATA_PARAMS(aarch64/Linux fast-math path) both useInlinedVector<..., 2>formax_len <= 2.std::vector, preserving prior behavior (and avoiding theInlinedVectoroverflow path, which I measured to regress for larger batches).The struct is ~64 bytes and trivially copyable, so the net effect is removing exactly one
malloc/freepair perCompute()call in the small-batch case.Motivation and Context
Many real CPU MatMul shapes flatten to
helper.OutputOffsets().size() == 1.MatMulComputeHelperspecial-cases an activation × 2D-weight matrix into a single GEMM (offsets{0}), so shapes like the following all hitmax_len == 1:These appear in transformer projections, MLP layers, and classifier heads. Removing a heap allocation from that path is a small, low-risk latency improvement.
Performance Measurements
Measured on Windows (MSVC,
RelWithDebInfo) withonnxruntime_perf_test, using 256-node square MatMul chains ([N,N]·[N,N], which flatten tomax_len == 1— the case this PR optimizes). 5 trials each; medians shown.per-node saving=(base_p50 − opt_p50) / 256.Interpretation (honest scope):
Compute()call, clearly and consistently visible at N=8/16/32 — exactly onemalloc/freepair removed.In short: this is a targeted micro-optimization that helps tiny-GEMM-heavy CPU workloads; for mainstream MatMul sizes the effect is below the measurement noise floor. It is intentionally low-risk rather than a broad throughput win.
Validation
onnxruntime_perf_testseparately for the baseline (std::vector) and optimized (InlinedVector) code for an apples-to-apples comparison.onnxruntime_provider_testwith the final change.