Skip to content

Avoid small MatMul batch parameter heap allocations#29085

Open
GopalakrishnanN wants to merge 1 commit into
mainfrom
GopalakrishnanN/OptimizeAllocations
Open

Avoid small MatMul batch parameter heap allocations#29085
GopalakrishnanN wants to merge 1 commit into
mainfrom
GopalakrishnanN/OptimizeAllocations

Conversation

@GopalakrishnanN

@GopalakrishnanN GopalakrishnanN commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Description

Reduce per-call heap allocation overhead in the CPU MatMul<float> kernel for the small batched-GEMM case.

MatMul<float>::Compute() builds an array of MLAS batch-parameter structs before calling MlasGemmBatch / MlasSBGemmBatch. Previously this always used std::vector, so even the common single-GEMM path performed one heap allocation per Compute() call.

This change uses stack-backed InlinedVector storage when helper.OutputOffsets().size() <= 2, and keeps the existing std::vector path for larger batches:

  • MLAS_SGEMM_DATA_PARAMS and MLAS_SBGEMM_DATA_PARAMS (aarch64/Linux fast-math path) both use InlinedVector<..., 2> for max_len <= 2.
  • Larger batches fall back to std::vector, preserving prior behavior (and avoiding the InlinedVector overflow path, which I measured to regress for larger batches).

The struct is ~64 bytes and trivially copyable, so the net effect is removing exactly one malloc/free pair per Compute() call in the small-batch case.

Motivation and Context

Many real CPU MatMul shapes flatten to helper.OutputOffsets().size() == 1. MatMulComputeHelper special-cases an activation × 2D-weight matrix into a single GEMM (offsets {0}), so shapes like the following all hit max_len == 1:

[M, K] x [K, N]
[B, S, K] x [K, N]
[B, H, S, K] x [K, N]

These appear in transformer projections, MLP layers, and classifier heads. Removing a heap allocation from that path is a small, low-risk latency improvement.

Performance Measurements

Measured on Windows (MSVC, RelWithDebInfo) with onnxruntime_perf_test, using 256-node square MatMul chains ([N,N]·[N,N], which flatten to max_len == 1 — the case this PR optimizes). 5 trials each; medians shown. per-node saving = (base_p50 − opt_p50) / 256.

N base avg (ms) opt avg (ms) base p50 (µs) opt p50 (µs) per-node saving
1 0.178246 0.177635 159.7 153.4 24.6 ns
8 0.189659 0.182407 167.7 158.4 36.3 ns
16 0.209381 0.194601 178.4 168.4 39.1 ns
32 0.307178 0.293141 279.4 269.7 37.9 ns
64 1.389432 1.362902 1371.2 1333.6 146.9 ns
128 3.403018 3.145609 3305.9 3123.3 713.3 ns

Interpretation (honest scope):

  • The benefit is a fixed ~36–39 ns per Compute() call, clearly and consistently visible at N=8/16/32 — exactly one malloc/free pair removed.
  • Because the saving is fixed, its true contribution shrinks as the GEMM grows: ~5–6% at N≤16, ~3.5% at N=32, but well under 1% for N≥64.
  • The larger apparent deltas at N=64/128 (147 ns / 713 ns per node) exceed what a fixed allocation can account for and are dominated by threaded-GEMM run-to-run variance (the baseline/optimized ranges overlap), so they should not be read as a real ~5% win.

In short: this is a targeted micro-optimization that helps tiny-GEMM-heavy CPU workloads; for mainstream MatMul sizes the effect is below the measurement noise floor. It is intentionally low-risk rather than a broad throughput win.

Validation

  • Built onnxruntime_perf_test separately for the baseline (std::vector) and optimized (InlinedVector) code for an apples-to-apples comparison.
  • Built onnxruntime_provider_test with the final change.
  • Focused MatMul tests pass:
onnxruntime_provider_test.exe --gtest_filter=*MatMul*
[==========] 430 tests from 16 test suites ran.
[  PASSED  ] 430 tests.
YOU HAVE 1 DISABLED TEST

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces per-invocation heap allocation overhead in the CPU MatMul<float>::Compute() path by using InlinedVector for very small batched GEMM parameter arrays, keeping the common 1–2 GEMM cases allocation-free while preserving the existing std::vector behavior for larger batches.

Changes:

  • Use InlinedVector<..., 2> for MLAS_SGEMM_DATA_PARAMS when helper.OutputOffsets().size() <= 2.
  • Apply the same small-batch inlining strategy for the aarch64/Linux fast-math MLAS_SBGEMM_DATA_PARAMS path.
  • Refactor batch setup into a shared lambda to avoid duplicating the fill+dispatch logic across container types.

// Licensed under the MIT License.

#include "core/providers/cpu/math/matmul.h"
#include "core/common/inlined_containers.h"
Comment on lines +351 to 357
if (max_len <= 2) {
InlinedVector<MLAS_SBGEMM_DATA_PARAMS, 2> data(max_len);
gemm_batch(data);
} else {
std::vector<MLAS_SBGEMM_DATA_PARAMS> data(max_len);
gemm_batch(data);
}
Comment on lines +377 to 383
if (max_len <= 2) {
InlinedVector<MLAS_SGEMM_DATA_PARAMS, 2> data(max_len);
gemm_batch(data);
} else {
std::vector<MLAS_SGEMM_DATA_PARAMS> data(max_len);
gemm_batch(data);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants