Skip to content

[CUDA] Enable XQA by default for FP16/BF16 GQA#29046

Open
namgyu-youn wants to merge 3 commits into
microsoft:mainfrom
namgyu-youn:xqa-perf
Open

[CUDA] Enable XQA by default for FP16/BF16 GQA#29046
namgyu-youn wants to merge 3 commits into
microsoft:mainfrom
namgyu-youn:xqa-perf

Conversation

@namgyu-youn

@namgyu-youn namgyu-youn commented Jun 14, 2026

Copy link
Copy Markdown

Description

Enable XQA by default for non-quantized FP16/BF16 GQA. XQA is a TensorRT-LLM-derived fused decode kernel (seq_len=1) that also fuses RoPE + KV-append into a single pass. Requires SM80+, shared KV buffer, no softcap.

Also add group_size=5 XQA support (e.g. Qwen3-14B: 40 Q-heads / 8 KV-heads) for the Qwen series.

Affected models

Model group_size Before After
Qwen3-8B 4 FlashDecode (opt-in XQA) XQA default
Qwen3-14B 5 No XQA XQA default
Qwen2-72B 8 FlashDecode (opt-in XQA) XQA default
Qwen2-7B 7 No XQA No XQA (unsupported tile width)

Performance Result

repro:

LD_PRELOAD=/usr/local/cuda-12.8/lib64/libcudart.so.12 \
  LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:/workspace/onnxruntime/.venv/lib/python3.12/site-packages/nvidia/cudnn/lib \
  PYTHONPATH=build/Release:onnxruntime/test/python/transformers \
  python onnxruntime/test/python/transformers/benchmark_gqa.py

Benchmark: Qwen3-14B (40Q/8KV heads, head_dim=128, fp16, batch=1, SM80, CUDA12.8)

past_seq_len XQA off XQA on Speedup
256 0.124 ms 0.092 ms 1.30×
512 0.122 ms 0.089 ms 1.33×
1024 0.133 ms 0.094 ms 1.42×
2048 0.136 ms 0.095 ms 1.43×
4096 0.152 ms 0.103 ms 1.42×
8191 0.182 ms 0.116 ms 1.57×

Speedup grows with context length (1.30× → 1.57×). No regression in the prompt/prefill path.

Testing

  • pytest onnxruntime/test/python/transformers/test_gqa.py -v
  • ORT_ENABLE_XQA=0 pytest onnxruntime/test/python/transformers/test_gqa.py -v (opt-out smoke test)

Signed-off-by: namgyu-youn <namgyu.dev@gmail.com>
Signed-off-by: namgyu-youn <namgyu.dev@gmail.com>
@namgyu-youn

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@namgyu-youn

namgyu-youn commented Jun 18, 2026

Copy link
Copy Markdown
Author

update via 1959cec (this PR): now XQA doesn't require "checkinf if it's quantized", so dropped dead-code (is_quantized); CI would be green now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant