[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch#18961
[Web] Pre-allocate TypedArray views for pod args in WebGPU dispatch#18961gnguralnick wants to merge 3 commits intoapache:mainfrom
Conversation
Hoist Int32Array/Uint32Array/Float32Array allocation out of the per-dispatch submitShader closure into the per-shader scope. Since podArgIndices.length is fixed for each shader, the views can be safely reused: every slot (0..podArgIndices.length) is written on each dispatch before writeBuffer copies the data, so no stale values can leak between invocations. This avoids 3 heap allocations + 1 ArrayBuffer per GPU kernel dispatch, which adds up in workloads with many small dispatches (e.g. LLM token generation).
There was a problem hiding this comment.
Code Review
This pull request optimizes WebGPU shader dispatches by pre-allocating and reusing typed array views for POD arguments, effectively reducing per-dispatch memory allocation overhead. The review feedback suggests further refining this by using Int32Array.BYTES_PER_ELEMENT instead of magic numbers and pre-calculating argument types to avoid string comparison overhead within the hot dispatch loop.
| i32ViewCached[i] = value; | ||
| } else if (dtype.startsWith("uint")) { | ||
| u32View[i] = value; | ||
| u32ViewCached[i] = value; | ||
| } else if (dtype.startsWith("float")) { | ||
| f32View[i] = value; | ||
| f32ViewCached[i] = value; |
There was a problem hiding this comment.
The dtype.startsWith string operations are executed for every POD argument on every dispatch. Since the argument types are fixed for each shader, consider pre-calculating an array of type indicators (e.g., an enum or numeric constants) in the createShadeInternal scope. This would allow replacing the string operations with a faster numeric check in the submitShader loop, which is beneficial for workloads with many small dispatches.
There was a problem hiding this comment.
Not sure if this needs to be in scope here though it does seem like a good suggestion
JiwaniZakir
left a comment
There was a problem hiding this comment.
The optimization is sound since device.queue.writeBuffer synchronously copies the ArrayBuffer contents into the GPU command queue, so reusing podArgsArrayBuffer across submitShader invocations is safe in JavaScript's single-threaded model.
The variable name maxPodArgs on line 700 is misleading — it's not a maximum but an exact count. podArgCount or numPodArgs would be more accurate and consistent with the rest of the codebase's naming style.
The removal of the comment // always pass in dim z launching grid size in (previously above the u32View[podArgIndices.length] = packDimX line) is a minor regression in documentation; the assignment of packDimX at the last index is non-obvious and worth explaining, especially since it sits outside the for loop iterating over podArgIndices.
One edge case worth verifying: if podArgIndices.length === 0 (a kernel with no pod args, only buffer args), podArgBytes will be 1 * 4 = 4 bytes, and getUniformFromPool will be called with that size. Confirm that getUniformFromPool and the WebGPU uniform buffer binding handle a 4-byte buffer correctly (some implementations have minimum binding size constraints), though this likely already worked before the change since the size calculation is equivalent.
Rename maxPodArgs to numPodSlots for clarity (it's a count, not a maximum) and restore an explanatory comment for the packDimX uniform slot assignment.
|
Addressed:
|
Summary
Int32Array/Uint32Array/Float32Arrayallocation out of the per-dispatchsubmitShaderclosure into the per-shadercreateShaderFuncscope, eliminating 3 typed array allocations + 1ArrayBufferper GPU kernel dispatch.podArgIndices.lengthis fixed per shader, so the cached views have the correct size for every invocation. Every slot0..podArgIndices.lengthis unconditionally written beforewriteBuffercopies the data out, so no stale values can leak between dispatches.Motivation
In workloads with many small dispatches (e.g. LLM token generation), the per-dispatch typed array allocations become a measurable source of GC pressure. Pre-allocating and reusing the views avoids this overhead.
Test plan
npm run lintpasses inweb/