Skip to content

[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch#18961

Open
gnguralnick wants to merge 3 commits intoapache:mainfrom
gnguralnick:webgpu-pod-args-prealloc
Open

[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch#18961
gnguralnick wants to merge 3 commits intoapache:mainfrom
gnguralnick:webgpu-pod-args-prealloc

Conversation

@gnguralnick
Copy link
Copy Markdown
Contributor

Summary

  • Hoists Int32Array/Uint32Array/Float32Array allocation out of the per-dispatch submitShader closure into the per-shader createShaderFunc scope, eliminating 3 typed array allocations + 1 ArrayBuffer per GPU kernel dispatch.
  • podArgIndices.length is fixed per shader, so the cached views have the correct size for every invocation. Every slot 0..podArgIndices.length is unconditionally written before writeBuffer copies the data out, so no stale values can leak between dispatches.
  • Builds on top of the batched dispatch architecture from Batched GPU dispatch and object caching for WebGPU runtime #18871 — the uniform buffer pool already gives each dispatch its own GPU-side buffer, so reusing the CPU-side staging array is safe.

Motivation

In workloads with many small dispatches (e.g. LLM token generation), the per-dispatch typed array allocations become a measurable source of GC pressure. Pre-allocating and reusing the views avoids this overhead.

Test plan

  • Verify npm run lint passes in web/
  • Run WebGPU model inference (e.g. via MLC-LLM web demo) and confirm correct output
  • Profile dispatch-heavy workload to confirm reduced allocation rate

Hoist Int32Array/Uint32Array/Float32Array allocation out of the
per-dispatch submitShader closure into the per-shader scope. Since
podArgIndices.length is fixed for each shader, the views can be
safely reused: every slot (0..podArgIndices.length) is written on
each dispatch before writeBuffer copies the data, so no stale
values can leak between invocations.

This avoids 3 heap allocations + 1 ArrayBuffer per GPU kernel
dispatch, which adds up in workloads with many small dispatches
(e.g. LLM token generation).
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes WebGPU shader dispatches by pre-allocating and reusing typed array views for POD arguments, effectively reducing per-dispatch memory allocation overhead. The review feedback suggests further refining this by using Int32Array.BYTES_PER_ELEMENT instead of magic numbers and pre-calculating argument types to avoid string comparison overhead within the hot dispatch loop.

Comment on lines +774 to +778
i32ViewCached[i] = value;
} else if (dtype.startsWith("uint")) {
u32View[i] = value;
u32ViewCached[i] = value;
} else if (dtype.startsWith("float")) {
f32View[i] = value;
f32ViewCached[i] = value;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dtype.startsWith string operations are executed for every POD argument on every dispatch. Since the argument types are fixed for each shader, consider pre-calculating an array of type indicators (e.g., an enum or numeric constants) in the createShadeInternal scope. This would allow replacing the string operations with a faster numeric check in the submitShader loop, which is beneficial for workloads with many small dispatches.

Copy link
Copy Markdown
Contributor Author

@gnguralnick gnguralnick Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this needs to be in scope here though it does seem like a good suggestion

Copy link
Copy Markdown

@JiwaniZakir JiwaniZakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimization is sound since device.queue.writeBuffer synchronously copies the ArrayBuffer contents into the GPU command queue, so reusing podArgsArrayBuffer across submitShader invocations is safe in JavaScript's single-threaded model.

The variable name maxPodArgs on line 700 is misleading — it's not a maximum but an exact count. podArgCount or numPodArgs would be more accurate and consistent with the rest of the codebase's naming style.

The removal of the comment // always pass in dim z launching grid size in (previously above the u32View[podArgIndices.length] = packDimX line) is a minor regression in documentation; the assignment of packDimX at the last index is non-obvious and worth explaining, especially since it sits outside the for loop iterating over podArgIndices.

One edge case worth verifying: if podArgIndices.length === 0 (a kernel with no pod args, only buffer args), podArgBytes will be 1 * 4 = 4 bytes, and getUniformFromPool will be called with that size. Confirm that getUniformFromPool and the WebGPU uniform buffer binding handle a 4-byte buffer correctly (some implementations have minimum binding size constraints), though this likely already worked before the change since the size calculation is equivalent.

Rename maxPodArgs to numPodSlots for clarity (it's a count, not a
maximum) and restore an explanatory comment for the packDimX uniform
slot assignment.
@gnguralnick
Copy link
Copy Markdown
Contributor Author

Addressed:

  1. maxPodArgs naming - makes sense, renamed to numPodSlots to better convey that it's a count of uniform buffer slots, not a maximum
  2. packDimX comment - agreed, restored an explainer comment
  3. podArgIndices.length === 0 edge case - my updated calculation is equivalent to the original one here, so there's no new risk introduced. The 4-byte uniform buffer would've been the behavior before as well

@gnguralnick gnguralnick requested a review from JiwaniZakir April 2, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants