The innermost (vectorized) loop dimension steps the parent indices by the
arrays' strides, which are runtime values. Even when the data is contiguous,
the compiler cannot prove unit stride and auto-vectorizes the loop with
gather/scatter instructions, which do not stream memory. For a contiguous
400x400 Float64 `copy!` this runs at ~8.5 GB/s (~300 us) instead of the
~33 GB/s a contiguous SIMD loop achieves.
Add a runtime branch: when every array is contiguous along loop dimension 1
(all innermost strides == 1), step the indices by the literal `1` so the
compiler emits streaming SIMD loads/stores. The post-loop index correction
reuses the existing return-stride expressions, which are numerically identical
because the stride equals 1.
Measured (contiguous 400x400 Float64, single thread): `copy!` 300 us -> 91 us
(~3.3x), matching the compile-time-constant-stride ideal. Non-contiguous
(e.g. transposed) inputs are unaffected and keep the existing path. Full test
suite passes (single- and multi-threaded).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
_mapreduce_kernel!steps the parent indices of the innermost (vectorized) loop dimension by the arrays' strides, which are runtime values. Even when the data is contiguous, the compiler cannot prove unit stride, so LLVM auto-vectorizes the inner loop with gather/scatter instructions (vgatherqpd/vscatterqpd). Gather/scatter address each lane individually and do not stream memory.The effect is severe for memory-bound contiguous ops. A contiguous
400×400Float64copy!(map!(identity, …)) runs at ~8.5 GB/s (~300 µs) instead of the ~33 GB/s a contiguous SIMD loop achieves. Minimal reproduction — a hand-written@simdcopy loop:C[i]=A[i], runtime stride (=1)C[i]=A[i], compile-time stride1Pure-copy /
map!bodies are hit hardest because LLVM's cost model judges gather/scatter SIMD "profitable" for them, whereas heavier accumulate bodies are left as (faster) scalar loops.Change
Add a runtime branch in the innermost loop: when every array is contiguous along loop dimension 1 (all innermost strides
== 1), step the parent indices by the literal1instead of the runtime stride. This lets the compiler emit streaming SIMD loads/stores. The post-loop index correction reuses the existing return-stride expressions, which are numerically identical because the stride equals1.The change is confined to the generated expression for the non-reduction innermost loop; non-contiguous (e.g. transposed) inputs are unaffected and take the existing path.
Results
Contiguous
400×400Float64, single thread:copy!(contiguous)copy!(transposed input)The contiguous result now matches the compile-time-constant-stride ideal.
Testing
JULIA_NUM_THREADS=4):map/scale!/axpy!/axpby!,copy,broadcasting,mapreduce,reduce,mul!.copy!/conj!/permuted copies/scaledmap!/binarymap!/reductions/axpby): max error0.0.Notes / possible follow-ups
iszero(stride)hoist branch) still gather contiguous inputs (e.g.sumover a contiguous array). An analogous unit-stride branch there would help; left out to keep this change focused.🤖 Generated with Claude Code