Avoid small MatMul batch parameter heap allocations by GopalakrishnanN · Pull Request #29085 · microsoft/onnxruntime

GopalakrishnanN · 2026-06-17T01:40:44Z

Description

Reduce per-call heap allocation overhead in the CPU MatMul<float> kernel for the small batched-GEMM case.

MatMul<float>::Compute() builds an array of MLAS batch-parameter structs before calling MlasGemmBatch / MlasSBGemmBatch. Previously this always used std::vector, so even the common single-GEMM path performed one heap allocation per Compute() call.

This change uses stack-backed InlinedVector storage when helper.OutputOffsets().size() <= 2, and keeps the existing std::vector path for larger batches:

MLAS_SGEMM_DATA_PARAMS and MLAS_SBGEMM_DATA_PARAMS (aarch64/Linux fast-math path) both use InlinedVector<..., 2> for max_len <= 2.
Larger batches fall back to std::vector, preserving prior behavior (and avoiding the InlinedVector overflow path, which I measured to regress for larger batches).

The struct is ~64 bytes and trivially copyable, so the net effect is removing exactly one malloc/free pair per Compute() call in the small-batch case.

Motivation and Context

Many real CPU MatMul shapes flatten to helper.OutputOffsets().size() == 1. MatMulComputeHelper special-cases an activation × 2D-weight matrix into a single GEMM (offsets {0}), so shapes like the following all hit max_len == 1:

[M, K] x [K, N]
[B, S, K] x [K, N]
[B, H, S, K] x [K, N]

These appear in transformer projections, MLP layers, and classifier heads. Removing a heap allocation from that path is a small, low-risk latency improvement.

Performance Measurements

Measured on Windows (MSVC, RelWithDebInfo) with onnxruntime_perf_test, using 256-node square MatMul chains ([N,N]·[N,N], which flatten to max_len == 1 — the case this PR optimizes). 5 trials each; medians shown. per-node saving = (base_p50 − opt_p50) / 256.

N	base avg (ms)	opt avg (ms)	base p50 (µs)	opt p50 (µs)	per-node saving
1	0.178246	0.177635	159.7	153.4	24.6 ns
8	0.189659	0.182407	167.7	158.4	36.3 ns
16	0.209381	0.194601	178.4	168.4	39.1 ns
32	0.307178	0.293141	279.4	269.7	37.9 ns
64	1.389432	1.362902	1371.2	1333.6	146.9 ns
128	3.403018	3.145609	3305.9	3123.3	713.3 ns

Interpretation (honest scope):

The benefit is a fixed ~36–39 ns per Compute() call, clearly and consistently visible at N=8/16/32 — exactly one malloc/free pair removed.
Because the saving is fixed, its true contribution shrinks as the GEMM grows: ~5–6% at N≤16, ~3.5% at N=32, but well under 1% for N≥64.
The larger apparent deltas at N=64/128 (147 ns / 713 ns per node) exceed what a fixed allocation can account for and are dominated by threaded-GEMM run-to-run variance (the baseline/optimized ranges overlap), so they should not be read as a real ~5% win.

In short: this is a targeted micro-optimization that helps tiny-GEMM-heavy CPU workloads; for mainstream MatMul sizes the effect is below the measurement noise floor. It is intentionally low-risk rather than a broad throughput win.

Validation

Built onnxruntime_perf_test separately for the baseline (std::vector) and optimized (InlinedVector) code for an apples-to-apples comparison.
Built onnxruntime_provider_test with the final change.
Focused MatMul tests pass:

onnxruntime_provider_test.exe --gtest_filter=*MatMul*
[==========] 430 tests from 16 test suites ran.
[  PASSED  ] 430 tests.
YOU HAVE 1 DISABLED TEST

Copilot

Pull request overview

This PR reduces per-invocation heap allocation overhead in the CPU MatMul<float>::Compute() path by using InlinedVector for very small batched GEMM parameter arrays, keeping the common 1–2 GEMM cases allocation-free while preserving the existing std::vector behavior for larger batches.

Changes:

Use InlinedVector<..., 2> for MLAS_SGEMM_DATA_PARAMS when helper.OutputOffsets().size() <= 2.
Apply the same small-batch inlining strategy for the aarch64/Linux fast-math MLAS_SBGEMM_DATA_PARAMS path.
Refactor batch setup into a shared lambda to avoid duplicating the fill+dispatch logic across container types.

 // Licensed under the MIT License.

 #include "core/providers/cpu/math/matmul.h"
+#include "core/common/inlined_containers.h"


+    if (max_len <= 2) {
+      InlinedVector<MLAS_SBGEMM_DATA_PARAMS, 2> data(max_len);
+      gemm_batch(data);
+    } else {
+      std::vector<MLAS_SBGEMM_DATA_PARAMS> data(max_len);
+      gemm_batch(data);
    }


+    if (max_len <= 2) {
+      InlinedVector<MLAS_SGEMM_DATA_PARAMS, 2> data(max_len);
+      gemm_batch(data);
+    } else {
+      std::vector<MLAS_SGEMM_DATA_PARAMS> data(max_len);
+      gemm_batch(data);
    }


Avoid small MatMul batch parameter heap allocations

5a6196f

GopalakrishnanN requested a review from Copilot June 17, 2026 01:43

Copilot started reviewing on behalf of GopalakrishnanN June 17, 2026 01:44 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

GopalakrishnanN requested a review from hariharans29 June 17, 2026 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid small MatMul batch parameter heap allocations#29085

Avoid small MatMul batch parameter heap allocations#29085
GopalakrishnanN wants to merge 1 commit into
mainfrom
GopalakrishnanN/OptimizeAllocations

GopalakrishnanN commented Jun 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GopalakrishnanN commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Performance Measurements

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GopalakrishnanN commented Jun 17, 2026 •

edited

Loading