Skip to content

Add 2-bit quantization support to WebGPU GatherBlockQuantized operator#29074

Open
Shivani767 wants to merge 2 commits into
microsoft:mainfrom
Shivani767:webgpu-gather-2bit-support
Open

Add 2-bit quantization support to WebGPU GatherBlockQuantized operator#29074
Shivani767 wants to merge 2 commits into
microsoft:mainfrom
Shivani767:webgpu-gather-2bit-support

Conversation

@Shivani767

@Shivani767 Shivani767 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Description

Adds 2‑bit quantization support to two WebGPU operators:

  • GatherBlockQuantized: Handles signed 2‑bit quantized data (2's complement, range [-2, 1]) and 2‑bit zero points (including when the packed zero point dimension isn't a multiple of 4). Follows the same patterns as the existing CPU implementation and includes WebGPU‑specific test coverage.
  • QMoE: Extends the constructor to accept expert_weight_bits_ == 2.

Motivation and Context

Resolves #28895! INT2 quantization is a hot research/industry topic for LLM serving, and this support enables using 2‑bit quantized weights with both WebGPU's GatherBlockQuantized and QMoE operators!

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jun 16, 2026
@guschmue guschmue requested a review from Copilot June 16, 2026 16:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends WebGPU-side quantized operator support to include 2-bit weights, primarily by updating the WebGPU GatherBlockQuantized shader generation and relaxing QMoE’s constructor validation to accept 2-bit expert weights.

Changes:

  • Add 2-bit extraction and (attempted) signed handling branches in GatherBlockQuantized WGSL generation (including 2-bit zero-point sign handling).
  • Allow expert_weight_bits == 2 in the WebGPU QMoE kernel constructor.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
onnxruntime/contrib_ops/webgpu/quantization/gather_block_quantized.cc Adds 2-bit-specific shader logic for reading packed 2-bit values and adjusts signed zero-point handling logic.
onnxruntime/contrib_ops/webgpu/moe/qmoe.h Expands accepted expert_weight_bits values to include 2-bit in the constructor validation.

Comment on lines +77 to +80
if (is_signed_) {
shader.MainFunctionBody()
<< " if((quantized_data & 0x2) != 0) { quantized_data = quantized_data - 4 ;};\n";
}
Comment on lines +156 to +162
if (is_2bit) {
shader.MainFunctionBody()
<< " if((zero_point & 0x2) != 0) { zero_point = zero_point - 4 ;};\n";
} else if (is_4bit) {
shader.MainFunctionBody()
<< " if((zero_point & 0x8) != 0) { zero_point = zero_point - 16 ;};\n";
}
Comment on lines +23 to +24
ORT_ENFORCE(expert_weight_bits_ == 8 || expert_weight_bits_ == 4 || expert_weight_bits_ == 2,
"expert_weight_bits must be 2, 4, or 8, but got ", expert_weight_bits_);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[WebGPU] Support 2-bit quantization in GatherBlockQuantized

3 participants