[CUDA] Fix QMoE int4/int8 weight prepack to always use SM80 layout#28978
Merged
Conversation
justinchuby
previously approved these changes
Jun 10, 2026
justinchuby
left a comment
Contributor
There was a problem hiding this comment.
Thanks for catching this!
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes incorrect INT4/INT8 QMoE results on Hopper by ensuring weight prepacking always produces the SM80 (column-interleaved) fpA_intB layout that the grouped MoE GEMM actually consumes (since the runtime dispatch always routes to the SM80 kernel even on SM90). It also centralizes the “arch → layout group” clamping logic and updates tests/docs to reflect the SM80-always behavior.
Changes:
- Add a shared
get_arch_for_mixed_gemm_weight_preprocess()helper and route layout selection through it. - Force CUDA QMoE INT4/INT8 expert-weight PrePack to emit SM80 layout irrespective of runtime SM.
- Update Python tests and documentation to pin/describe SM80 layout usage and tighten parity tolerances.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/python/transformers/test_qmoe_cuda.py | Removes dead mixed-GEMM prepack helper; minor comment text update. |
| onnxruntime/test/python/transformers/test_moe_cuda.py | Pins offline packing to arch=80 and tightens FP16 INT4/INT8 tolerances. |
| onnxruntime/python/onnxruntime_pybind_quant.cc | Replaces ad-hoc SM allowlist with centralized arch-clamping helper. |
| onnxruntime/core/graph/contrib_ops/contrib_defs.cc | Updates QMoE weights_prepacked attribute docs to be EP-determined. |
| onnxruntime/contrib_ops/cuda/moe/moe_quantization.h | Expands/clarifies CUDA EP semantics around weights_prepacked. |
| onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc | Forces SM80 layout packing for int weights; updates comments and gating text. |
| onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors.h | Declares shared get_arch_for_mixed_gemm_weight_preprocess() helper. |
| onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.h | Uses the shared helper to select layout details for transforms. |
| onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.cu | Implements the shared helper that clamps SM to layout groups. |
| docs/ContribOperators.md | Regenerates schema docs for updated weights_prepacked description. |
| docs/contrib_ops/cuda/moe_qmoe.md | Updates CUDA MoE/QMoE documentation for weights_prepacked and SM80-always rationale. |
justinchuby
approved these changes
Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The CUDA QMoE INT4/INT8 grouped GEMM always dispatches to the Ampere (SM80) CUTLASS kernel — even on Hopper (SM90) — because mixed int-weight + fp16/bf16 activation is not a valid Hopper TMA warp-specialized specialisation. This PR makes weight prepacking always emit the SM80 (column-interleaved)
fpA_intBlayout regardless of the runtime device SM, fixing silently-wrong output on Hopper, and centralizes the arch-clamping logic in a single shared helper. It also cleans up the related tests and tightens MoE parity tolerances that were too loose to catch the layout bug.Motivation
#28749 uses 90 for sm90 weight prepacking.
On SM90,
isValidHopperMOESpecialisation<half_t, uint4b_t/uint8_t>()isfalse, so the grouped MoE GEMM falls back to the SM80 kernel. The weight preprocessor, however, skips column interleaving forarch == 90, so an auto-detected (force_arch=-1) pack on an H200 produced the non-interleaved SM90 layout that the SM80 kernel cannot consume — yielding wrong results. The previousPrePackIntExpertWeightslogic clamped tosm_(passing SM90 through), and the test that exercised the offline packer used auto-detect, so both could emit the wrong layout.Key Changes
fpA_intB_gemm_preprocessors{.h,_impl.cu}get_arch_for_mixed_gemm_weight_preprocess(int arch)as a shared, declared helper (clamps SM to the layout group:<80→75,90→90, else80).fpA_intB_gemm_preprocessors_impl.hgetLayoutDetailsForTransformnow routes through the shared helper instead of duplicating the arch-range logic.moe_quantization.cc(PrePackIntExpertWeights)get_arch_for_mixed_gemm_weight_preprocess(80)) instead of clamping to the runtimesm_, since the SM80 kernel runs on every GPU.onnxruntime_pybind_quant.cc(PackWeightsForMixedGemm){75,80,90}allowlist with the shared helper, soforce_archis clamped consistently with the runtime dispatch (removes the now-unused<set>include).contrib_defs.cc/moe_quantization.hweights_prepackedschema/field docs: layouts for-1/1are EP-determined; for the CUDA EP-1and1are equivalent today (both SM80),1reserved for a future Hopper-specific layout.test_qmoe_cuda.pypreprocess_weights_for_mixed_gemmhelper; the real path (quant_dequant_blockwise) already pinssm=80.test_moe_cuda.pyarch=80, and tightened FP16 QMoE parity tolerance fromatol 3.0 (4-bit)/2.0 (8-bit)to0.5now that the layout is correct.docs/ContribOperators.mdand updatedmoe_qmoe.mdto match the new schema docs and SM80-always packing rationale.Testing Notes
On an H200 (SM90), with the CUDA 12.x/13.x Python wheel:
python -m pytest onnxruntime/test/python/transformers/test_qmoe_cuda.py python -m pytest onnxruntime/test/python/transformers/test_moe_cuda.py -k "PhiQMoE or qmoe"test_qmoe_cuda.pySwiGLU parity: SM80 layout → max diff ~0.001 (pass, tol 0.1); the prior SM90 layout produced max diff ~1.2 (fail), confirming the fix.test_moe_cuda.pyTestPhiQMoE(4-bit and 8-bit, all batch/seq combinations): worst observedmax_diff≈ 0.375 with the fixed layout, comfortably under the newatol=0.5.ruff checkpasses on both edited test files.