Skip to content

perf: speed up exact diff common cases#24

Merged
knaeckeKami merged 2 commits into
knaeckeKami:masterfrom
bernaferrari:perf/optimizations
Jun 6, 2026
Merged

perf: speed up exact diff common cases#24
knaeckeKami merged 2 commits into
knaeckeKami:masterfrom
bernaferrari:perf/optimizations

Conversation

@bernaferrari

@bernaferrari bernaferrari commented Dec 22, 2025

Copy link
Copy Markdown
Contributor

Summary

This replaces the earlier broad optimization experiment with a smaller exact-Myers optimization pass aimed at common diff workloads and reviewability.

Changes:

  • trims common suffixes before running Myers, while intentionally not prefix-trimming so duplicate anchoring stays compatible with issue Removing 2 consecutive elements from a list returns incorrect updates #15
  • reuses one typed-data backing buffer for forward/backward k-lines and result status arrays
  • shares the middle-snake range traversal between the delegate and interned paths, while keeping the hot midpoint/snake loops specialized
  • keeps the normal delegate path direct, so common exact diffs do not pay an indirect comparator/interner cost
  • only interns list items for larger, non-aligned middle ranges where integer ID comparison can help
  • skips interning when a custom equalityChecker is supplied, because HashMap equality may not match the caller's equality semantics
  • adds hash-collision regression coverage for both update APIs
  • adds an AOT/JIT benchmark harness under tool/bench/bench.dart

The implementation preserves exact diff behavior. There is no heuristic cutoff or intentionally non-minimal mode.

Validation

Local checks on the pushed branch:

dart analyze
dart test
dart compile exe tool/bench/bench.dart -o build/diff_bench_current
./build/diff_bench_current
dart run tool/bench/bench.dart

I also ran the same benchmark harness against master by copying tool/bench/bench.dart into a clean master worktree before compiling, so both binaries used the same measurement code.

Benchmark settings:

  • detectMoves: false
  • warmups: 3
  • samples: 10
  • target: 20000us
  • table values are median microseconds per iteration

Speedup is master / branch, so values below 1.00x are regressions. These are local microbenchmarks and the smallest rows are noisy.

AOT Benchmarks

type size diffs master median us branch median us speedup
int 10 none 0.48 0.48 1.00x
int 10 few 0.70 0.70 1.00x
int 10 many 2.29 2.51 0.91x
object 10 none 0.72 0.60 1.20x
object 10 few 0.94 0.98 0.96x
object 10 many 2.18 3.13 0.70x
int 100 none 2.40 1.75 1.37x
int 100 few 3.26 3.42 0.95x
int 100 many 109.59 98.41 1.11x
object 100 none 5.10 3.22 1.58x
object 100 few 6.41 3.09 2.07x
object 100 many 123.66 123.54 1.00x
int 1000 none 21.12 13.71 1.54x
int 1000 few 46.70 39.07 1.20x
int 1000 many 9047.00 8572.50 1.06x
object 1000 none 45.20 24.21 1.87x
object 1000 few 101.40 90.44 1.12x
object 1000 many 12735.00 8416.50 1.51x
int 10000 none 207.28 137.78 1.50x
int 10000 few 1097.50 1179.22 0.93x
int 10000 many 935030.00 1141486.00 0.82x
object 10000 none 503.58 252.95 1.99x
object 10000 few 1769.63 1699.56 1.04x
object 10000 many 1869459.00 1132685.00 1.65x

JIT Benchmarks

type size diffs master median us branch median us speedup
int 10 none 0.80 0.48 1.67x
int 10 few 1.57 1.02 1.54x
int 10 many 4.22 2.83 1.49x
object 10 none 1.03 0.63 1.63x
object 10 few 1.46 1.15 1.27x
object 10 many 4.42 3.27 1.35x
int 100 none 3.80 2.77 1.37x
int 100 few 5.62 3.17 1.77x
int 100 many 164.26 133.38 1.23x
object 100 none 7.75 2.95 2.63x
object 100 few 9.18 4.34 2.12x
object 100 many 261.44 209.27 1.25x
int 1000 none 39.65 17.87 2.22x
int 1000 few 52.17 67.46 0.77x
int 1000 many 13706.50 12507.00 1.10x
object 1000 none 82.90 33.10 2.50x
object 1000 few 122.88 118.20 1.04x
object 1000 many 20288.00 10895.00 1.86x
int 10000 none 233.22 176.02 1.32x
int 10000 few 895.56 1063.09 0.84x
int 10000 many 1235363.00 899683.00 1.37x
object 10000 none 619.02 341.70 1.81x
object 10000 few 2135.50 2731.13 0.78x
object 10000 many 2077002.00 1249991.00 1.66x

Notes

This does not claim a universal speedup. It materially improves no-change and many object-list workloads, keeps the algorithm exact, and preserves the historical duplicate anchoring covered by issue #15. A few int-heavy few/many rows are neutral to slower; accepting that is preferable to adding an imara-style non-minimal cutoff or changing duplicate matching semantics.

@knaeckeKami

Copy link
Copy Markdown
Owner

Interning now maps items to integer IDs using only hashCode, so distinct items with the same hash are treated as identical. Dart allows hash collisions, which changes correctness: the diff can return no updates when items actually changed (e.g., CollisionPair(1,2) vs CollisionPair(2,1) both hash to 3 via xor). I pushed failing regression tests on this PR branch to demonstrate the issue. A fix likely needs collision handling (bucket by hash + verify ==) or a guard/option to disable interning when collisions are possible.

@knaeckeKami

knaeckeKami commented Dec 27, 2025

Copy link
Copy Markdown
Owner

Or: add a caller-supplied key (e.g., keyOf/idOf) to calculateListDiff so interning can use stable IDs instead of raw hashes.

@bernaferrari

Copy link
Copy Markdown
Contributor Author

Very very good catch! Fixed!

@knaeckeKami

knaeckeKami commented Dec 29, 2025

Copy link
Copy Markdown
Owner

I added an AOT benchmark harness (tool/bench/bench.dart) following Dart microbenchmarking guidance (AOT compile, warmups, calibration to a target runtime, fixed inputs; refs: https://mrale.ph/blog/2021/01/21/microbenchmarking-dart-part-1.html and https://mrale.ph/blog/2024/11/27/microbenchmarks-are-experiments.html). I ran it against:

  1. master
  2. the initial PR head (8bd5a66, hash-collision bug)
  3. current head (collision fix)

It reports median us/iter for sizes 10/100/1000/10000, diff patterns none/few/many, for both int lists and object lists (8-field class with standard ==/hashCode).

Summary:

  • For none/few diffs the PR is materially slower than master (often ~3–12x).
  • For many diffs the PR is faster (~10–15%).
  • The collision fix adds ~1–2% overhead vs the buggy interner.

Full tables (median us/iter, AOT):

int

size diffs master (us) bug 8bd5a66 (us) after (us)
10 none 0.29 0.45 0.96
10 few 0.45 0.59 1.10
10 many 1.07 1.08 1.88
100 none 1.17 6.12 5.89
100 few 1.64 6.68 6.35
100 many 53.55 49.93 55.33
1000 none 9.97 57.38 57.45
1000 few 22.21 69.63 67.61
1000 many 4927.75 4238.63 4314.00
10000 none 98.77 1124.00 1124.19
10000 few 425.30 1168.28 1175.91
10000 many 486125.00 418311.00 419665.00

object

size diffs master (us) bug 8bd5a66 (us) after (us)
10 none 0.32 0.72 1.39
10 few 0.49 0.87 1.52
10 many 1.34 1.43 2.31
100 none 1.34 8.91 8.54
100 few 1.89 9.30 8.87
100 many 71.59 56.79 62.90
1000 none 11.47 88.04 86.46
1000 few 26.47 99.44 98.09
1000 many 6736.00 4311.38 4378.88
10000 none 117.26 1454.06 1443.38
10000 few 528.05 1476.00 1474.06
10000 many 666148.00 419549.00 423003.00

This seems at odds with the PR description claiming ~20% to 10x speedups. Can you clarify how those measurements were obtained (workload, inputs, tooling, JIT vs AOT, warmups/samples)? I want to align the benchmark methodology so we compare apples-to-apples.

@knaeckeKami knaeckeKami self-assigned this Dec 29, 2025
@bernaferrari bernaferrari changed the title perf: optimize diff algorithm with 4x speedup perf: speed up exact diff common cases May 29, 2026
@bernaferrari

bernaferrari commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

Sorry for the long wait. I forgot. I asked codex to rewrite everything and make sure every benchmark had speedups. To also test JIT vs AOT. It should be 1.5~2x faster. Updated here.

For even faster, I didn't push, but it is possible to be inspired by https://docs.rs/imara-diff/latest/imara_diff/ which is the fastest diff out there. It adds a 256 cutoff to meyers which makes it up to 36x faster on large changes (10000 | many), however it doesn't guarantee minimal diffs, so it is not 100% equivalent to meyers. If you think that is interesting I can push for you to see.

@knaeckeKami

Copy link
Copy Markdown
Owner

Thank you! LGTM.

I noticed that in some cases this can change the edit script when duplicates are involved, e.g.
old=[0], new=[0,0]:
old: Insert(position: 1)
new: Insert(position: 0)

Both are valid, but I'll release this with a major version bump so use cases that depend on the old order don't break.

For even faster, I didn't push, but it is possible to be inspired by https://docs.rs/imara-diff/latest/imara_diff/ which is the fastest diff out there. It adds a 256 cutoff to meyers which makes it up to 36x faster on large changes (10000 | many), however it doesn't guarantee minimal diffs, so it is not 100% equivalent to meyers. If you think that is interesting I can push for you to see.

Yes! But I think we should make this opt-in and document it, not the default behaviour.

@knaeckeKami knaeckeKami merged commit e5e7c97 into knaeckeKami:master Jun 6, 2026
1 check passed
@bernaferrari

Copy link
Copy Markdown
Contributor Author

Yay!! Thanks a lot for your patience. Feel free to make a test for that duplication issue so this is known in the future (if behavior ever changes again accidentally)

@knaeckeKami

Copy link
Copy Markdown
Owner

I don't know if it makes sense to lock-in particular sequences of operations out of the many possible valid ones - I just think if we release an update that might change them, we should do it with a major version bump, especially after so many years of no updates to be conservative and avoid breaking users that accidentally depend on one particular order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants