[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch by gnguralnick · Pull Request #18961 · apache/tvm

gnguralnick · 2026-03-31T23:06:21Z

Summary

Hoists Int32Array/Uint32Array/Float32Array allocation out of the per-dispatch submitShader closure into the per-shader createShaderFunc scope, eliminating 3 typed array allocations + 1 ArrayBuffer per GPU kernel dispatch.
podArgIndices.length is fixed per shader, so the cached views have the correct size for every invocation. Every slot 0..podArgIndices.length is unconditionally written before writeBuffer copies the data out, so no stale values can leak between dispatches.
Builds on top of the batched dispatch architecture from Batched GPU dispatch and object caching for WebGPU runtime #18871 — the uniform buffer pool already gives each dispatch its own GPU-side buffer, so reusing the CPU-side staging array is safe.

Motivation

In workloads with many small dispatches (e.g. LLM token generation), the per-dispatch typed array allocations become a measurable source of GC pressure. Pre-allocating and reusing the views avoids this overhead.

Test plan

Verify npm run lint passes in web/
Run WebGPU model inference (e.g. via MLC-LLM web demo) and confirm correct output
Profile dispatch-heavy workload to confirm reduced allocation rate

Hoist Int32Array/Uint32Array/Float32Array allocation out of the per-dispatch submitShader closure into the per-shader scope. Since podArgIndices.length is fixed for each shader, the views can be safely reused: every slot (0..podArgIndices.length) is written on each dispatch before writeBuffer copies the data, so no stale values can leak between invocations. This avoids 3 heap allocations + 1 ArrayBuffer per GPU kernel dispatch, which adds up in workloads with many small dispatches (e.g. LLM token generation).

gemini-code-assist

Code Review

This pull request optimizes WebGPU shader dispatches by pre-allocating and reusing typed array views for POD arguments, effectively reducing per-dispatch memory allocation overhead. The review feedback suggests further refining this by using Int32Array.BYTES_PER_ELEMENT instead of magic numbers and pre-calculating argument types to avoid string comparison overhead within the hot dispatch loop.

web/src/webgpu.ts

gemini-code-assist · 2026-03-31T23:08:51Z

web/src/webgpu.ts

+            i32ViewCached[i] = value;
          } else if (dtype.startsWith("uint")) {
-            u32View[i] = value;
+            u32ViewCached[i] = value;
          } else if (dtype.startsWith("float")) {
-            f32View[i] = value;
+            f32ViewCached[i] = value;


The dtype.startsWith string operations are executed for every POD argument on every dispatch. Since the argument types are fixed for each shader, consider pre-calculating an array of type indicators (e.g., an enum or numeric constants) in the createShadeInternal scope. This would allow replacing the string operations with a faster numeric check in the submitShader loop, which is beneficial for workloads with many small dispatches.

Not sure if this needs to be in scope here though it does seem like a good suggestion

JiwaniZakir

The optimization is sound since device.queue.writeBuffer synchronously copies the ArrayBuffer contents into the GPU command queue, so reusing podArgsArrayBuffer across submitShader invocations is safe in JavaScript's single-threaded model.

The variable name maxPodArgs on line 700 is misleading — it's not a maximum but an exact count. podArgCount or numPodArgs would be more accurate and consistent with the rest of the codebase's naming style.

The removal of the comment // always pass in dim z launching grid size in (previously above the u32View[podArgIndices.length] = packDimX line) is a minor regression in documentation; the assignment of packDimX at the last index is non-obvious and worth explaining, especially since it sits outside the for loop iterating over podArgIndices.

One edge case worth verifying: if podArgIndices.length === 0 (a kernel with no pod args, only buffer args), podArgBytes will be 1 * 4 = 4 bytes, and getUniformFromPool will be called with that size. Confirm that getUniformFromPool and the WebGPU uniform buffer binding handle a 4-byte buffer correctly (some implementations have minimum binding size constraints), though this likely already worked before the change since the size calculation is equivalent.

Rename maxPodArgs to numPodSlots for clarity (it's a count, not a maximum) and restore an explanatory comment for the packDimX uniform slot assignment.

gnguralnick · 2026-04-02T17:44:58Z

Addressed:

maxPodArgs naming - makes sense, renamed to numPodSlots to better convey that it's a count of uniform buffer slots, not a maximum
packDimX comment - agreed, restored an explainer comment
podArgIndices.length === 0 edge case - my updated calculation is equivalent to the original one here, so there's no new risk introduced. The 4-byte uniform buffer would've been the behavior before as well

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

Address review: use BYTES_PER_ELEMENT, hoist podArgBytes

ccb9b77

JiwaniZakir reviewed Apr 2, 2026

View reviewed changes

Address review: rename maxPodArgs, restore packDimX comment

7d850cf

Rename maxPodArgs to numPodSlots for clarity (it's a count, not a maximum) and restore an explanatory comment for the packDimX uniform slot assignment.

gnguralnick requested a review from JiwaniZakir April 2, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch#18961

[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch#18961
gnguralnick wants to merge 3 commits intoapache:mainfrom
gnguralnick:webgpu-pod-args-prealloc

gnguralnick commented Mar 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

gnguralnick Apr 1, 2026 •

edited

Loading

Uh oh!

JiwaniZakir left a comment

Uh oh!

gnguralnick commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gnguralnick commented Mar 31, 2026

Summary

Motivation

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gnguralnick Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JiwaniZakir left a comment

Choose a reason for hiding this comment

Uh oh!

gnguralnick commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gnguralnick Apr 1, 2026 •

edited

Loading