[CUDA] Enable XQA by default for FP16/BF16 GQA by namgyu-youn · Pull Request #29046 · microsoft/onnxruntime

namgyu-youn · 2026-06-14T10:38:28Z

Description

Enable XQA by default for non-quantized FP16/BF16 GQA. XQA is a TensorRT-LLM-derived fused decode kernel (seq_len=1) that also fuses RoPE + KV-append into a single pass. Requires SM80+, shared KV buffer, no softcap.

Also add group_size=5 XQA support (e.g. Qwen3-14B: 40 Q-heads / 8 KV-heads) for the Qwen series.

Affected models

Model	group_size	Before	After
Qwen3-8B	4	FlashDecode (opt-in XQA)	XQA default
Qwen3-14B	5	No XQA	XQA default
Qwen2-72B	8	FlashDecode (opt-in XQA)	XQA default
Qwen2-7B	7	No XQA	No XQA (unsupported tile width)

Performance Result

repro:

LD_PRELOAD=/usr/local/cuda-12.8/lib64/libcudart.so.12 \
  LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:/workspace/onnxruntime/.venv/lib/python3.12/site-packages/nvidia/cudnn/lib \
  PYTHONPATH=build/Release:onnxruntime/test/python/transformers \
  python onnxruntime/test/python/transformers/benchmark_gqa.py

Benchmark: Qwen3-14B (40Q/8KV heads, head_dim=128, fp16, batch=1, SM80, CUDA12.8)

past_seq_len	XQA off	XQA on	Speedup
256	0.124 ms	0.092 ms	1.30×
512	0.122 ms	0.089 ms	1.33×
1024	0.133 ms	0.094 ms	1.42×
2048	0.136 ms	0.095 ms	1.43×
4096	0.152 ms	0.103 ms	1.42×
8191	0.182 ms	0.116 ms	1.57×

Speedup grows with context length (1.30× → 1.57×). No regression in the prompt/prefill path.

Testing

pytest onnxruntime/test/python/transformers/test_gqa.py -v
ORT_ENABLE_XQA=0 pytest onnxruntime/test/python/transformers/test_gqa.py -v (opt-out smoke test)

Signed-off-by: namgyu-youn <namgyu.dev@gmail.com>

namgyu-youn · 2026-06-14T10:39:41Z

@microsoft-github-policy-service agree

namgyu-youn · 2026-06-18T05:22:47Z

update via 1959cec (this PR): now XQA doesn't require "checkinf if it's quantized", so dropped dead-code (is_quantized); CI would be green now.

namgyu-youn added 2 commits June 14, 2026 19:18

enable xqa for fp16/bf16 gqa

4ffcc31

Signed-off-by: namgyu-youn <namgyu.dev@gmail.com>

fix patch

a9164da

Signed-off-by: namgyu-youn <namgyu.dev@gmail.com>

fix

1959cec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Enable XQA by default for FP16/BF16 GQA#29046

[CUDA] Enable XQA by default for FP16/BF16 GQA#29046
namgyu-youn wants to merge 3 commits into
microsoft:mainfrom
namgyu-youn:xqa-perf

namgyu-youn commented Jun 14, 2026 •

edited

Loading

Uh oh!

namgyu-youn commented Jun 14, 2026

Uh oh!

namgyu-youn commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

namgyu-youn commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Affected models

Performance Result

Testing

Uh oh!

namgyu-youn commented Jun 14, 2026

Uh oh!

namgyu-youn commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

namgyu-youn commented Jun 14, 2026 •

edited

Loading

namgyu-youn commented Jun 18, 2026 •

edited

Loading