Skip to content

Add unit-stride fast path to _mapreduce_kernel!#70

Merged
lkdvos merged 3 commits into
mainfrom
ld-unit-stride-kernel
Jun 26, 2026
Merged

Add unit-stride fast path to _mapreduce_kernel!#70
lkdvos merged 3 commits into
mainfrom
ld-unit-stride-kernel

Conversation

@lkdvos

@lkdvos lkdvos commented Jun 25, 2026

Copy link
Copy Markdown
Member

Problem

_mapreduce_kernel! steps the parent indices of the innermost (vectorized) loop dimension by the arrays' strides, which are runtime values. Even when the data is contiguous, the compiler cannot prove unit stride, so LLVM auto-vectorizes the inner loop with gather/scatter instructions (vgatherqpd/vscatterqpd). Gather/scatter address each lane individually and do not stream memory.

The effect is severe for memory-bound contiguous ops. A contiguous 400×400 Float64 copy! (map!(identity, …)) runs at ~8.5 GB/s (~300 µs) instead of the ~33 GB/s a contiguous SIMD loop achieves. Minimal reproduction — a hand-written @simd copy loop:

inner loop (160 000 elems) time asm
C[i]=A[i], runtime stride (=1) 298 µs gather/scatter
C[i]=A[i], compile-time stride 1 91 µs contiguous

Pure-copy / map! bodies are hit hardest because LLVM's cost model judges gather/scatter SIMD "profitable" for them, whereas heavier accumulate bodies are left as (faster) scalar loops.

Change

Add a runtime branch in the innermost loop: when every array is contiguous along loop dimension 1 (all innermost strides == 1), step the parent indices by the literal 1 instead of the runtime stride. This lets the compiler emit streaming SIMD loads/stores. The post-loop index correction reuses the existing return-stride expressions, which are numerically identical because the stride equals 1.

The change is confined to the generated expression for the non-reduction innermost loop; non-contiguous (e.g. transposed) inputs are unaffected and take the existing path.

Results

Contiguous 400×400 Float64, single thread:

op before after
copy! (contiguous) ~300 µs ~91 µs (~3.3×)
copy! (transposed input) ~293 µs ~259 µs (unchanged path)

The contiguous result now matches the compile-time-constant-stride ideal.

Testing

  • Full Strided test suite passes, single- and multi-threaded (JULIA_NUM_THREADS=4): map/scale!/axpy!/axpby!, copy, broadcasting, mapreduce, reduce, mul!.
  • Additional correctness sweep (81 cases over Float32/Float64/ComplexF64 × ndims 1–4, covering copy!/conj!/permuted copies/scaled map!/binary map!/reductions/axpby): max error 0.0.

Notes / possible follow-ups

  • Reduction loops (the iszero(stride) hoist branch) still gather contiguous inputs (e.g. sum over a contiguous array). An analogous unit-stride branch there would help; left out to keep this change focused.
  • The gather/scatter code path remains in the binary as the fallback for the genuinely strided case; only the runtime branch taken for contiguous data changes.

🤖 Generated with Claude Code

lkdvos and others added 2 commits June 24, 2026 20:41
The innermost (vectorized) loop dimension steps the parent indices by the
arrays' strides, which are runtime values. Even when the data is contiguous,
the compiler cannot prove unit stride and auto-vectorizes the loop with
gather/scatter instructions, which do not stream memory. For a contiguous
400x400 Float64 `copy!` this runs at ~8.5 GB/s (~300 us) instead of the
~33 GB/s a contiguous SIMD loop achieves.

Add a runtime branch: when every array is contiguous along loop dimension 1
(all innermost strides == 1), step the indices by the literal `1` so the
compiler emits streaming SIMD loads/stores. The post-loop index correction
reuses the existing return-stride expressions, which are numerically identical
because the stride equals 1.

Measured (contiguous 400x400 Float64, single thread): `copy!` 300 us -> 91 us
(~3.3x), matching the compile-time-constant-stride ideal. Non-contiguous
(e.g. transposed) inputs are unaffected and keep the existing path. Full test
suite passes (single- and multi-threaded).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refactor comments for clarity and conciseness.
@lkdvos

lkdvos commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

TLDR here: if a kernel hits a case where all accesses are secretly stride 1, this is not detected since these are runtime values, and (at least on my machine) instead of using contiguous loads, uses gather/scatter machine instructions. These are significantly slower, and at least for many of our use cases we are trying to optimize our tensors for running into this case as often as possible.

As a sidenote, the reason I found this is that I was experimenting with map! vs _mapreducedim! with an init-op, where somehow C[I1] = A[I2] was slower than C[I1] = C[I1] + A[I2], which really didn't make sense to me.
Inspecting the machine code, it turns out that my compiler decided that in the former case it would emit SIMD instructions (requiring gather/scatter because non-unitstride), while in the latter it wouldn't because the compiler determined that scalar instructions are more efficient than the two gathers + single scatters are.
This cost model is slightly inaccurate, and actually the scalar instructions end up faster for this on my machine which made copy! slower than mapreducedim! with a beta.

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/mapreduce.jl 93.33% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/mapreduce.jl 81.13% <93.33%> (+0.51%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lkdvos lkdvos requested a review from Jutho June 25, 2026 15:29
Comment thread src/mapreduce.jl Outdated
@lkdvos lkdvos merged commit 5f4cf01 into main Jun 26, 2026
10 of 13 checks passed
@lkdvos lkdvos deleted the ld-unit-stride-kernel branch June 26, 2026 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants