QMoE: fail loudly when weights_prepacked=0 but PrePack did not run#28965
Merged
justinchuby merged 4 commits intoJun 11, 2026
Merged
Conversation
Copilot
AI
changed the title
[WIP] Fix silent wrong output when weights_prepacked is set to 0
QMoE: fail loudly when weights_prepacked=0 but PrePack did not run
Jun 9, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the CUDA QMoE execution path to avoid silent wrong outputs when models rely on PrePack()-produced CUTLASS layouts (int weights_prepacked=0, and native wfp4afp8) but prepacking never ran (e.g., session.disable_prepacking).
Changes:
- Add an
INVALID_ARGUMENTguard forquant_type='int' && weights_prepacked=0when the required prepacked int-weight buffers are missing. - Add an
INVALID_ARGUMENTguard for nativewfp4afp8when the repacked FP4 weight buffers are missing, instead of falling back to raw initializer bytes.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ong-output-with-prepacked-weights
tianleiwu
approved these changes
Jun 11, 2026
tianleiwu
left a comment
Contributor
There was a problem hiding this comment.
Verdict: Approve.
This is a clean, well-scoped defensive fix that converts a silent wrong-output path into a loud, actionable error.
Correctness
- The int guard is placed right after
int_weights_consumed_by_prepackis computed and fires exactly whenis_int && !weights_prepacked_but one of the prepack buffers is null — i.e. the negation ofint_weights_consumed_by_prepackwithin that subset. After the guard, theweights_prepacked=0int path is guaranteed to have both buffers, which also closes the partial-prepack concern noted in the surrounding comment. - The wfp4afp8 native guard is correct:
PrePackRepackFP4Weightsis the only producer ofpacked_fp4_fc{1,2}_weights_, and theweights_prepackedattribute is int-only, so a null repacked buffer on the native path can only mean PrePack did not run. The previous fall-through to raw initializer bytes was genuinely unsafe. - The
is_packedhandling on those paths keeps the source initializer alive, so failing loudly here does not risk dangling pointers.
Tests
- The new
QMoETest_CUDA_Int4_DisablePrepackingFailsLoudlyexercises the int branch viasession.disable_prepacking=1and asserts the actionable failure message. Because the guard returns before any kernel launch, it runs on any SM ≥ 700 and doesn't depend on CUTLASS shape constraints.
One optional maintainability suggestion left inline.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When a QMoE model sets
weights_prepacked=0(raw[E, N, K/pack]int weights) and the session hassession.disable_prepacking,PrePack()never runs, sopacked_fc{1,2}_weights_stay null andint_weights_consumed_by_prepackis false. The code then falls through to the raw initializer pointers — but those bytes are not in CUTLASS layout, so the runner consumes them as-if-prepacked and produces silently wrong output with no diagnostic.Changes in
onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc(QMoE::ComputeInternal):INVALID_ARGUMENTguard — whenis_int && !weights_prepacked_but either prepack buffer is null, return a clear error instead of feeding non-CUTLASS bytes to the runner.packed_fp4_fc{1,2}_weights_ ? ... : raw) replaced with an explicit guard that errors when the repacked FP4 buffers were not produced.Also added a focused regression test in
onnxruntime/test/contrib_ops/moe_test.cccoveringquant_type='int'withweights_prepacked=0andsession.disable_prepacking=1, asserting that QMoE fails with an actionable error instead of producing output.Merged the branch with the latest
main.Motivation and Context
A prior fix removed the null-pointer crash on this path but left a misleading-success outcome that is newly user-reachable via the
weights_prepacked=0contract — the exact silent-failure mode the offline-path work set out to eliminate. These guards convert that into a loud, actionable error. The wfp4afp8 branch shares the same fall-through and is hardened for consistency.The added regression test ensures this fail-loudly behavior remains covered going forward, especially when prepacking is disabled at the session level.