Fix CPU GQA NaN output for right-padded batched prompts with rotary embeddings by Copilot · Pull Request #29069 · microsoft/onnxruntime

Copilot · 2026-06-16T07:20:38Z

Description

Fixes NaN output in the CPU GQA kernel when running batched right-padded prefill. For padding token positions where seq_causal_length > total_seqlen, the softmax loop was reading beyond the GEMM-filled region of the attention probs buffer into uninitialized memory, producing NaN values that propagated through the V GEMM to the output.

Root cause: In ComputeAttentionProbs, seq_causal_length = causal_past_seqlen + seq + 1 grows with each query position. For right-padded batches, a batch item with real_len < sequence_length has total_seqlen = real_len, but padding positions still iterate up to sequence_length, giving seq_causal_length > total_seqlen. The QK GEMM only fills columns [0, total_seqlen) — positions beyond that are uninitialized.

Fix: Cap the effective causal length at total_seqlen before computing the softmax window:

// gqa_attention_base.h - both float and quantized paths
const size_t effective_causal_length = std::min(seq_causal_length, total_seqlen);
// use effective_causal_length for: local window check, start_offset, window_size, masking loops

Applied to both the non-quantized float path (~line 1097) and the quantized MLAS path (~line 436).

Motivation and Context

The new test GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CPU (added in this PR) exercises batched GQA with heterogeneous real sequence lengths {4, 2, 6} padded to sequence_length=6. Batch item 1 (real_len=2) has padding tokens at positions 2–5; position 3 triggered the NaN via uninitialized attention probs memory.

qjia7 · 2026-06-16T13:59:41Z

+    GTEST_SKIP() << "WebGPU EP not available";
+  }
+  RunBatchedRightPaddedRotaryPrefillForEP(/*use_cuda=*/false, /*use_webgpu=*/true);
+}


@copilot Please remove BatchedRightPaddedRotaryPrefill_WebGPU test since I will fix it in PR #29002

Done — removed BatchedRightPaddedRotaryPrefill_WebGPU test and cleaned up the now-unused use_webgpu parameter from the helper functions in commit Remove BatchedRightPaddedRotaryPrefill_WebGPU test per review feedback.

Copilot

Pull request overview

This PR fixes a CPU GroupQueryAttention (GQA) correctness bug for batched right-padded prefill with rotary embeddings, where the softmax step could read beyond the QK GEMM-initialized region and propagate NaNs into the output. The fix caps the causal length used to compute the softmax window to the per-batch total_seqlen, and adds a regression test that validates per-batch consistency versus single-prompt execution.

Changes:

Cap per-token causal length at total_seqlen in both the float and quantized/MLAS softmax+masking paths to prevent out-of-bounds/uninitialized reads.
Add a regression test for batched right-padded packed-QKV rotary prefill, comparing each batch item’s real-last-token output to a batch=1 reference (CPU and CUDA EPs).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h`	Caps the effective causal length used by the softmax/masking window to stay within the GEMM-written `[0, total_seqlen)` region (float + quantized paths).
`onnxruntime/test/contrib_ops/group_query_attention_op_test.cc`	Adds a right-padded batched rotary prefill regression test that compares each prompt’s real-last-token output to a single-prompt reference.

+  // Each batch's real-last-token output (used to predict next token) must match
+  // its single-prompt reference. The tolerance is loose enough for fp16 rounding
+  // while still catching the underflow bug (which produces values that differ
+  // by orders of magnitude or are NaN/Inf).


Initial plan

ae30cf3

Copilot AI assigned Copilot and qjia7 Jun 16, 2026

Copilot started work on behalf of qjia7 June 16, 2026 07:20 View session

Copilot AI added 2 commits June 16, 2026 07:45

Initial plan: Fix CPU GQA NaN for right-padded batched prefill

48e7455

Fix CPU GQA NaN softmax bug for right-padded batched prompts

17ceadc

Copilot AI changed the title ~~[WIP] Fix failing GitHub Actions job for wasm_Debug~~ Fix CPU GQA NaN output for right-padded batched prompts with rotary embeddings Jun 16, 2026

Copilot finished work on behalf of qjia7 June 16, 2026 08:18

Copilot AI requested a review from qjia7 June 16, 2026 08:18

qjia7 reviewed Jun 16, 2026

View reviewed changes

Copilot started work on behalf of qjia7 June 16, 2026 14:00 View session

Remove BatchedRightPaddedRotaryPrefill_WebGPU test per review feedback

9c692b3

Copilot AI requested a review from qjia7 June 16, 2026 14:03

Copilot finished work on behalf of qjia7 June 16, 2026 14:03

yuslepukhin requested a review from Copilot June 17, 2026 20:30

Copilot started reviewing on behalf of yuslepukhin June 17, 2026 20:31 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CPU GQA NaN output for right-padded batched prompts with rotary embeddings#29069

Fix CPU GQA NaN output for right-padded batched prompts with rotary embeddings#29069
Copilot wants to merge 4 commits into
mainfrom
copilot/fix-wasm-debug-build-job

Copilot AI commented Jun 16, 2026 •

edited

Loading

Uh oh!

qjia7 Jun 16, 2026

Uh oh!

Copilot AI Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

qjia7 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Jun 16, 2026 •

edited

Loading