Skip to content

[MLAS] Fix NHWC conv support gating#29127

Open
JonathanC-ARM wants to merge 5 commits into
microsoft:mainfrom
JonathanC-ARM:jonclo01/disable_dw_conv_layout_transform
Open

[MLAS] Fix NHWC conv support gating#29127
JonathanC-ARM wants to merge 5 commits into
microsoft:mainfrom
JonathanC-ARM:jonclo01/disable_dw_conv_layout_transform

Conversation

@JonathanC-ARM

Copy link
Copy Markdown
Contributor

Description

This PR fixes a performance regression introduced by overly broad Arm® KleidiAI™ NHWC Conv support checks.

The NHWC transformer was routing depthwise/grouped convolutions into the KleidiAI NHWC NhwcFusedConv path whenever the generic channels-last support predicate returned true. However, many depthwise shapes do not have a dedicated fast KleidiAI kernel and were falling back to the generic NHWC/im2col-style path, which is significantly slower than the existing NCHW execution path.

This change splits the support checks into separate predicates for:

  • generic KleidiAI NHWC imatmul convolution
  • KleidiAI NHWC depthwise convolution

The depthwise predicate is intentionally disabled for now, so unsupported depthwise/grouped convolutions remain on the existing NCHW path. A follow-up PR can enable the depthwise predicate for shapes covered by a dedicated KleidiAI depthwise kernel.

Summary

  • Adds explicit KleidiAI NHWC support predicates.
  • Prevents depthwise/grouped Conv nodes from being transformed into NHWC unless explicitly supported.
  • Keeps regular supported KleidiAI NHWC Conv routing unchanged.
  • Avoids severe regressions on depthwise-heavy models.

Performance impact

Measured with:

onnxruntime_perf_test -x 1 -y 1 -e cpu -r 100 -I <model.onnx>

Representative improvements vs the regressed KleidiAI NHWC baseline:

Model Regressed KleidiAI baseline After this fix Improvement
de_efficientnetlitev3_f32.onnx ~299 ms ~61 ms ~80% faster
de_efficientnetlitev3_f16.onnx ~299 ms ~62 ms ~79% faster
deeplabv3_mobilenetv2_f32.onnx ~783 ms ~49 ms ~94% faster
deeplabv3_mobilenetv2_f16.onnx ~783 ms ~48 ms ~94% faster
mobilenet_v1_f32.onnx ~55 ms ~4 ms ~93% faster
mobilenetv1_ssd_f32.onnx ~92 ms ~9 ms ~90% faster
retinaface_f32.onnx ~289 ms ~38 ms ~87% faster

The fix also restores performance to at or better than the pre-NHWC-routing reference for several affected models.

JonathanC-ARM and others added 2 commits June 17, 2026 12:49
Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
Comment thread onnxruntime/core/mlas/lib/convolve.cpp Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines the CPU NHWC Conv fast-path gating for Arm KleidiAI by splitting the previous “channels-last supported” predicate into distinct checks for (1) KleidiAI NHWC imatmul Conv and (2) KleidiAI NHWC depthwise/grouped Conv, with depthwise/grouped intentionally disabled for now to avoid routing regressions. This integrates into ORT’s NHWC transformer and CPU Conv kernel selection logic, ensuring only shapes with a known fast KleidiAI kernel are transformed/executed in NHWC.

Changes:

  • Introduces explicit MLAS support predicates for KleidiAI NHWC imatmul Conv and KleidiAI NHWC depthwise Conv (currently stubbed to false).
  • Updates NHWC transformer + CPU Conv fast-path gating to use the new predicates, preventing depthwise/grouped Conv from being routed to the KleidiAI NHWC path.
  • Updates related unit tests and MLAS benchmarks to use the new predicate split.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
onnxruntime/test/optimizer/nhwc_transformer_test.cc Updates NHWC capability helper to use new KleidiAI imatmul/depthwise predicates.
onnxruntime/test/mlas/bench/bench_sconv.cpp Updates KleidiAI NHWC benchmark gating to accept imatmul or depthwise predicate.
onnxruntime/test/contrib_ops/fused_conv_test.cc Updates NHWC fused-conv support helper to use new predicates.
onnxruntime/core/providers/cpu/nn/conv.cc Refines NHWC fast-path selection to require explicit KleidiAI imatmul/depthwise support.
onnxruntime/core/optimizer/nhwc_transformer.cc Updates NHWC transformer Conv filter to use split KleidiAI support predicates.
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp Switches KleidiAI override capability check to the imatmul-specific predicate.
onnxruntime/core/mlas/lib/convolve.cpp Adds new KleidiAI imatmul predicate, introduces depthwise predicate stub (disabled), and rewraps legacy symmetric predicate.
onnxruntime/core/mlas/inc/mlas.h Declares the new KleidiAI NHWC support predicate APIs.

Comment thread onnxruntime/core/mlas/inc/mlas.h Outdated
Comment thread onnxruntime/core/mlas/lib/convolve.cpp Outdated
Comment thread onnxruntime/test/optimizer/nhwc_transformer_test.cc
Comment thread onnxruntime/test/contrib_ops/fused_conv_test.cc Outdated
@hariharans29

Copy link
Copy Markdown
Member

Review of PR #29127Fix NHWC conv support gating

Verdict: approve with optional follow-ups. Real bug, real fix, strong perf numbers, clean diff. A few naming/cleanup points worth raising as nits but none are blocking.

What the PR does

Splits the old single predicate MlasConvSupportsSymmetricChannelsLast2DFloatKernel into two:

  • MlasConvSupportsKleidiAIImatmulChannelsLast2DFloatKernel — keeps the regular-conv branch only (GroupCount == 1 && FilterCount > 1, kernel ≥ 3×3, etc.).
  • MlasConvSupportsKleidiAIDepthwiseChannelsLast2DFloatKernelunconditionally returns false for now; placeholder for a future dedicated KleidiAI depthwise kernel.

The retained MlasConvSupportsSymmetricChannelsLast2DFloatKernel is reduced to a thin forwarder to the imatmul predicate (ABI-compat shim).

All five call sites are rewired to call the imatmul predicate first and, where the depthwise path is also a candidate (Conv kernel + NHWC transformer + fused conv test + bench + nhwc_transformer test), the depthwise predicate via short-circuit OR. The KleidiAI SME CheckCapabilitiesSme only routes through the imatmul predicate.

Correctness — fix is sound

Old behavior in onnxruntime/core/mlas/lib/convolve.cpp:1395-1405 (pre-PR):

const bool is_depthwise = GroupCount > 1;
if (is_depthwise) {
  if (FilterCount != 1) { return false; }
} else if (FilterCount <= 1) {
  return false;
}

This accepted depthwise shapes (group>1, filter==1, kernel≥3×3) as "NHWC fast-path supported." The NHWC transformer then converted those nodes to NHWC layout, and at runtime the KleidiAI SME path lacked a dedicated depthwise kernel — execution fell through to a generic im2col-style NHWC path that was substantially slower than the pre-existing NCHW path. Author's perf numbers confirm 80–94% regressions on depthwise-heavy models (efficientnet-lite, deeplabv3+mobilenetv2, mobilenet, retinaface).

New imatmul predicate flatly rejects GroupCount != 1 at onnxruntime/core/mlas/lib/convolve.cpp:1384-1386, and the depthwise predicate is a perma-false stub. So the NHWC transformer leaves grouped/depthwise Conv nodes on the NCHW path. That's exactly the right fix.

The behavioral change vs main: any shape with GroupCount > 1 is now rejected from NHWC. Pre-PR, only GroupCount > 1 && FilterCount != 1 was rejected; pre-PR depthwise with FilterCount == 1 would be accepted. Post-PR, that subset is also rejected. This is the intended regression fix. No regular (group=1) conv case changes behavior.

Concerns / nits (none blocking)

1. The retained MlasConvSupportsSymmetricChannelsLast2DFloatKernel shim has zero in-tree callers post-merge. All five callsites are rewired in this PR. Keep it as a public ABI guarantee for downstream consumers of mlas.h, or remove it as dead code? If MLAS public API is treated as stable across versions (include/onnxruntime/core/session/onnxruntime_c_api.h is the firm boundary, but core/mlas/inc/mlas.h is internal-to-ORT IIRC), I'd lean toward removing the shim and the wrapper symbol — it's a misleading name now (forwards to imatmul only, not "symmetric"). Optional.

2. Naming bakes "KleidiAI" into a generic-looking MLAS predicate name. MlasConvSupportsKleidiAIImatmulChannelsLast2DFloatKernel is descriptive but couples the public-ish API to a specific Arm backend. If a non-Arm NHWC fast path ever appears, the name becomes a lie. Cleaner alternative:

MlasConvSupportsImatmulChannelsLast2DFloatKernel       // implementation = currently KleidiAI on Arm
MlasConvSupportsDepthwiseChannelsLast2DFloatKernel

Or, since "imatmul" is itself a KleidiAI internal term, just describe the capability:

MlasConvSupportsNonDepthwiseChannelsLast2DFloatKernel
MlasConvSupportsDepthwiseChannelsLast2DFloatKernel

The #if defined(USE_KLEIDIAI) && defined(MLAS_TARGET_ARM64) guard inside the body already encodes the platform coupling; the symbol name doesn't need to. Bikeshed-territory but worth flagging — public MLAS symbols are hard to rename later.

3. The depthwise predicate has no #if defined(USE_KLEIDIAI) && defined(MLAS_TARGET_ARM64) guard. Currently fine because the body is a single return false. But when the follow-up PR wires the real depthwise kernel, the author needs to remember to mirror the imatmul predicate's guard at onnxruntime/core/mlas/lib/convolve.cpp:1360 — otherwise the new body will try to compile on non-Arm/non-KleidiAI builds. Worth a // TODO(kleidiai-depthwise): add #if defined(USE_KLEIDIAI) ... breadcrumb in the stub body.

4. The unconditional return false stub uses ten MLAS_UNREFERENCED_PARAMETER lines. Acceptable for a soon-to-be-real function. If the follow-up isn't imminent, consider removing the parameters entirely and reintroducing them when the body materializes. Mild preference; either works.

5. No negative-path unit test in onnxruntime/test/optimizer/nhwc_transformer_test.cc. The perf numbers cover behavior end-to-end, but a focused regression test asserting that a depthwise/grouped Conv input graph is not transformed to NHWC would lock the fix in. Otherwise, someone restoring the old single-predicate behavior in a future refactor wouldn't be caught at PR time. Add one (e.g., NhwcTransformerTests.GroupedConvStaysOnNchwPath) — cheap insurance, given the original bug was a silent gating regression.

6. Formatting in the shim (already flagged by @hariharans29 inline): the continuation lines under the multi-line forwarder at onnxruntime/core/mlas/lib/convolve.cpp:1456-1465 look off in the GitHub diff view. Running lintrunner -a should fix it; standard clang-format style for MLAS aligns continuations under the open-paren column.

Sanity checks done

Bottom line

This is the right fix and the perf numbers are compelling. The split-predicate scaffolding for a future depthwise kernel is forward-looking and worth carrying. Land it after the formatting nit is addressed; the naming and test-coverage points are optional and can be deferred to the follow-up PR that fills in the depthwise predicate body.

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
@hariharans29

Copy link
Copy Markdown
Member

Re-review of PR #29127Fix NHWC conv support gating

Verdict: approve once the formatting nit on the shim is addressed and the shim's semantic change is acknowledged. Substantively unchanged since the prior review; commit 595d7d7 is just a main-merge to refresh CI. CI is currently in-flight (24/74 reported green at fetch time; no regressions yet). The performance numbers in the PR description (~80–94 % wins on EfficientNet-Lite-v3, DeepLab-v3, MobileNet-v1/v2, RetinaFace) more than justify the gating split.

Things that look right

1. Shape of the API split is correct. Two predicates (MlasConvSupportsKleidiAIImatmulChannelsLast2DFloatKernel for the dense imatmul kernel, MlasConvSupportsKleidiAIDepthwiseChannelsLast2DFloatKernel as a placeholder for the future depthwise kernel) is the clean factoring. Callers OR them together to compute the "NHWC fastpath is accepted" boolean. When a real depthwise kernel ships, only the depthwise predicate body changes — callers don't need updating.

2. Tightened imatmul predicate. Replaces the prior is_depthwise/else mux with a single hard-reject at the top of convolve.cpp:1384-1386 (if (GroupCount != 1 || FilterCount <= 1) return false;). Functionally equivalent to the old "non-depthwise + FilterCount > 1" branch but reads as a single intent. The depthwise branch's previous predicate is intentionally not preserved — that's the regression fix.

3. Backward-compat shim preserves the public symbol. convolve.cpp:1442-1466 keeps MlasConvSupportsSymmetricChannelsLast2DFloatKernel as a delegating shim instead of deleting it. Right call — it's in mlas.h and removing it would be an ABI break for downstream consumers of MLAS as a static-library API.

4. InputChannels plumbed into the depthwise predicate everywhere. All five call sites consistently pass the per-group input channels (weight_shape[1] from [M, C/group, kH, kW] Conv weight layout, or C / group_count). Even though the predicate's body currently MLAS_UNREFERENCED_PARAMETERs the arg, callers are already supplying the right value so the depthwise kernel hookup will be a body-only edit.

5. Performance numbers are convincing. A 5-10× speedup on multiple production-class models (DeepLab-v3, MobileNet-v1/v2, RetinaFace) by reverting depthwise back to NCHW is decisive — it strongly suggests this regression should also be backported to any release branches that picked up the over-broad NHWC routing.

Concerns

1. Formatting nit on the shim is still standing. convolve.cpp:1456-1465 — the continuation args after Dimensions, aren't aligned to the open paren the way the rest of the file does it. The function-name length is making it visually ambiguous. Either reformat as:

return MlasConvSupportsKleidiAIImatmulChannelsLast2DFloatKernel(
    Dimensions,
    BatchCount,
    GroupCount,
    InputShape,
    KernelShape,
    DilationShape,
    Padding,
    StrideShape,
    FilterCount,
    Beta);

(break-after-open-paren, single 4-space indent — matches the predominant style in convolve.cpp), or run lintrunner -a so clang-format picks it up consistently.

2. The shim's accept set has silently narrowed. Before this PR, MlasConvSupportsSymmetricChannelsLast2DFloatKernel accepted depthwise convs with FilterCount == 1 (the old is_depthwise branch eventually returned true if other gates passed). After the PR, the shim delegates to the new imatmul-only predicate, which hard-rejects GroupCount != 1. So any external caller still using the legacy symbol silently loses depthwise NHWC routing.

In-tree callers were all updated, so this is invisible inside the repo — but the public-API symbol's behavior changed without the symbol itself changing. Two options:

  • Recommended: document this in the function comment in include/onnxruntime/core/mlas/inc/mlas.h — e.g., // Equivalent to MlasConvSupportsKleidiAIImatmulChannelsLast2DFloatKernel; the depthwise NHWC kernel is selected via MlasConvSupportsKleidiAIDepthwiseChannelsLast2DFloatKernel. Pre-#29127 behavior also accepted depthwise convs; that path now stays on the NCHW kernel until the dedicated depthwise NHWC kernel ships.
  • Alternative: deprecate the shim outright with [[deprecated("Use MlasConvSupportsKleidiAI{Imatmul,Depthwise}ChannelsLast2DFloatKernel")]]. Heavy-handed but signals intent.

Either way, the silent semantic shift on a still-exported public symbol shouldn't go in undocumented.

3. The 5-way imatmul || depthwise duplication is a maintenance hazard. The pattern

MlasConvSupportsKleidiAIImatmulChannelsLast2DFloatKernel(...)
  || MlasConvSupportsKleidiAIDepthwiseChannelsLast2DFloatKernel(..., input_channels_per_group, ...)

now appears in nhwc_transformer.cc, conv.cc, fused_conv_test.cc, bench_sconv.cpp, and nhwc_transformer_test.cc. A single helper would be cleaner:

// In mlas.h or a kleidiai-local header
bool MLASCALL MlasConvSupportsKleidiAIChannelsLast2DFloatKernel(
    size_t Dimensions, size_t BatchCount, size_t GroupCount,
    size_t InputChannels,
    const size_t* InputShape, const size_t* KernelShape,
    const size_t* DilationShape, const size_t* Padding,
    const size_t* StrideShape, size_t FilterCount, float Beta);

…implemented as the OR of the two specialized predicates. The five call sites become one-line dispatches. This also gives you a natural place to document "this is the umbrella predicate; the two specialized variants exist for callers that need to distinguish dense imatmul from depthwise."

Could be deferred to a follow-up if you want this PR to stay tight on the regression fix.

4. Depthwise predicate exposed in mlas.h while still a no-op. mlas.h:957-970 puts MlasConvSupportsKleidiAIDepthwiseChannelsLast2DFloatKernel on the public MLAS API surface despite the body being a hard return false. Acceptable because callers need a stable name to compile against pre-implementation, but the doc-comment should be explicit that "this predicate currently returns false unconditionally and will be enabled once the Arm KleidiAI depthwise NHWC kernel is integrated; signature is stable." Otherwise a downstream consumer reading the header sees a fully-spec'd predicate with no indication that it's a stub.

Nits

5. The TODO has no tracking pointer. convolve.cpp:1437-1439:

// Regression fix: keep depthwise/grouped convolutions out of the Arm® KleidiAI™
// NHWC path until the dedicated depthwise kernel is integrated.
return false;

If there's a GitHub issue tracking the depthwise-kernel integration work, link it here so a future reader can find the status quickly.

Bottom line

The substantive design is good and the perf wins are real and large. Three things to address before merge:

  1. Fix the formatting on the legacy-shim body (or run lintrunner -a).
  2. Document — in mlas.h — that MlasConvSupportsSymmetricChannelsLast2DFloatKernel's accept set has narrowed (no longer accepts depthwise).
  3. Document that MlasConvSupportsKleidiAIDepthwiseChannelsLast2DFloatKernel is currently a stub.

The 5-way OR duplication consolidation is suggested for a follow-up — fine to defer if you want to keep this PR scoped to the regression fix.

Approve once 1–3 are addressed.

@hariharans29

Copy link
Copy Markdown
Member

Re-review of PR #29127Fix NHWC conv support gating (updated)

Verdict: approve, with one minor stylistic suggestion. The author has addressed every substantive concern from the prior review and the changes are cleaner than the alternatives I suggested.

What changed since my last pass

1. The "silent semantic narrowing" of the shim is gone — the symbol was renamed instead. mlas.h:945 and convolve.cpp:1348 rename MlasConvSupportsSymmetricChannelsLast2DFloatKernelMlasConvSupportsDenseChannelsLast2DFloatKernel. This is the right move:

  • No backward-compat shim with surprising behavior.
  • Any external caller using the old name gets a clean compile error rather than a silent regression on depthwise routing.
  • The name Dense is also more accurate — the old Symmetric was a misleading hangover (the predicate has nothing to do with symmetric quantization; it's about the dense imatmul kernel).
  • Yes this is a soft ABI break for the public MLAS symbol, but MLAS isn't typically consumed as a stable stand-alone library and the explicit-compile-error path is far safer than a behavioral change under the same symbol.

2. The formatting nit is moot. No shim, no awkward continuation indentation. Resolved by removal.

3. The depthwise predicate's parameter is now InputChannelsPerGroup. Better than the earlier InputChannels — disambiguates against the total channel count.

4. New regression-locking test. nhwc_transformer_test.cc renames ConvDepthwiseFloat_UsesHelperCapabilityConvDepthwiseFloat_SkipNhwcUntilDepthwiseKernelEnabled and replaces the previous conditional capability check with hardcoded:

EXPECT_FALSE(HasFloatNhwcNoTransposeSupport({1, 8, 7, 7}, {8, 1, 3, 3}, {}, {}, {}, 8));
EXPECT_EQ(op_to_count["Conv"], 1);
EXPECT_EQ(op_to_count["com.microsoft.NhwcFusedConv"], 0);
EXPECT_EQ(op_to_count["Transpose"], 0);

This locks in the "depthwise stays on NCHW" invariant at the graph-transformer level. Bonus: when the depthwise kernel eventually ships and this test starts failing, the test's name will guide the future maintainer to the right follow-up. Self-documenting failure mode.

5. New weight_shape[1] <= 0 guards in fused_conv_test.cc and nhwc_transformer_test.cc before extracting input_channels_per_group. Defensive against unknown-dim sentinels (-1) and zero. Reasonable.

6. The depthwise predicate doc-comment is more explicit. The TODO at convolve.cpp now reads:

// TODO: enable only for shapes supported by the dedicated
// depthwise kernel. Until then, keep depthwise/grouped convolutions out of
// the Arm® KleidiAI™ NHWC path.
return false;

Clear enough for a future implementer.

One small remaining nit

MlasConvSupportsDepthwiseChannelsLast2DFloatKernel body has two #if/#else branches that both return false. Currently functionally identical — the split exists purely as scaffolding for the eventual KleidiAI body in the #else branch. Suggest collapsing for now:

bool
MLASCALL
MlasConvSupportsDepthwiseChannelsLast2DFloatKernel(...) {
    MLAS_UNREFERENCED_PARAMETER(Dimensions);
    // ... all the other UNREFERENCED_PARAMETERs
    // TODO: enable only for shapes supported by the dedicated KleidiAI
    // depthwise NHWC kernel. Until then (and on non-KleidiAI builds), keep
    // depthwise/grouped convolutions out of the NHWC path.
    return false;
}

When the kernel ships, add #if defined(USE_KLEIDIAI) && defined(MLAS_TARGET_ARM64) around the new logic only. Less duplicated unreferenced-param noise, same end state.

This is genuinely stylistic — feel free to leave as-is if you prefer the scaffolding-now form.

Things still worth tracking as follow-ups (not blocking)

  • The 5-way dense || depthwise OR pattern is still duplicated across nhwc_transformer.cc, conv.cc, fused_conv_test.cc, bench_sconv.cpp, and nhwc_transformer_test.cc. A MlasConvSupportsAnyKleidiAIChannelsLast2DFloatKernel(...) umbrella would consolidate. Defer; the OR shape is clear enough.

  • When the depthwise kernel lands, the ConvDepthwiseFloat_SkipNhwcUntilDepthwiseKernelEnabled test name should be re-thought (probably split into "shapes the new kernel covers → NHWC" and "shapes it doesn't → stays NCHW"). Out of scope here.

Bottom line

Both substantive concerns from my prior pass — the shim's silent semantic narrowing and the formatting on the delegation — are gone, addressed by the cleaner rename approach. The added regression-locking test is a nice extra. Only the optional #if/#else collapse is left, and it's a stylistic choice.

LGTM. Approve.

@hariharans29 hariharans29 changed the title Fix NHWC conv support gating [MLAS] Fix NHWC conv support gating Jun 18, 2026
@hariharans29 hariharans29 added release:1.27.1 regression issues that demonstrate a regression in ORT functionality and need to be addressed immediately labels Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

regression issues that demonstrate a regression in ORT functionality and need to be addressed immediately release:1.27.1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants