perf(pattern): cache compiled patterns behind a bounded ETS table#348
perf(pattern): cache compiled patterns behind a bounded ETS table#348davydog187 wants to merge 5 commits into
Conversation
Pattern.compile/1 ran on every string.find/match/gsub call. Add compile_cached/1, a memoized front for the pure compile/1, backed by a lazily-created bounded ETS set keyed on the raw pattern binary. Repeated -pattern loops now compile once and read thereafter; an adversarial varied-pattern stream is capped by clear-and-restart eviction. gmatch keeps compiling directly since it already compiles once per iterator. Plan: F
The bounded compiled-pattern cache regressed trivial repeated patterns
(string.match(s, "%d+")): a single ETS lookup (~60-95 ns) costs more
than recompiling a short pattern (~35 ns), so the cache was pure overhead
on the exact hot loop it targeted, and taxed miss-heavy varied-pattern
workloads by ~25%.
Two guards keep the cache where it earns its keep:
1. Patterns <= 8 bytes bypass the cache entirely and compile inline
(one byte_size guard, no ETS call). Trivial patterns now match the
bare compile/1 baseline.
2. Eligible patterns are inserted only on their second sighting; the
first sighting records a cheap sentinel. One-shot patterns never get
promoted to a full compiled entry, keeping the all-miss path closer
to bare compile/1.
The hit path now addresses the named table directly via
:ets.lookup(@cache_table, ...) instead of resolving it through
:ets.whereis on every call; the slow path falls back to whereis/create.
Re-benchmarked on the interpreter path (100k loops, full mode, median):
repeated %d+ 58.9 -> 60.1 ms (parity, within +-5% noise), repeated
expensive datetime pattern 136.2 -> 125.3 ms (-8% time) and 771 -> 606 MB
(-21% memory) -- the real win -- and the all-distinct miss path 141.4 ->
176.5 ms (+25%, now documented honestly, not 'negligible'). Cache stays
bounded at 512 entries.
Tightens the eviction comment (true max resident is exactly 512, cleared
before insert) and notes the benign racy-but-safe TOCTOU under concurrent
writers.
| case :ets.whereis(@cache_table) do | ||
| :undefined -> | ||
| try do | ||
| :ets.new(@cache_table, [ |
There was a problem hiding this comment.
tables are owned by their creating processes. I'm not so sure that we want to create a table like this. Maybe this is an opt in feature and the user must put Lua in their supervision tree or something
https://www.erlang.org/doc/apps/stdlib/ets.html#new/2
I think there's also a problem there if the table gets too big, there's a resource leak issue, so we'd want a max size or something on the cache
🤖 Ultracode review — round 1All three findings are minor (no blockers). The pattern cache is correctness-transparent — a wrong threshold or a lost creation race changes speed, never results — so these are confidence/coverage gaps rather than defects.
|
🤖 Ultracode review — round 1Major
Minor
|
… benchmark Address review feedback on the compiled-pattern cache: - maybe_evict_and_insert/3 now rescues ArgumentError (table deleted out from under it), recreates the table, and retries the insert once, mirroring the read path's :no_table handling. The moduledoc's "never a correctness dependency" / "transparently recreated" guarantee now holds on writes, not just reads. - Add benchmarks/pattern_ops.exs: a length sweep that exposes the @cache_min_len=8 compile-vs-cached crossover plus trivial / expensive / miss-stream gsub workloads exercising the @cache_max_entries=512 cap, so the tuning constants are reproducible instead of prose-only. Moduledoc now points at the harness. - Transparency test: split into bypass (<=8 byte) and cached (>8 byte) lists; the cached list has a >8-byte variant of every element type (anchored, set, capture, %b, quantifiers, position capture) so the sentinel -> promotion -> hit transitions are asserted tuple-identical through ETS, not just on the bypass path. - Cache-table tests: assert the specific key is absent rather than that the whole global named table is :undefined, so concurrent recreation by async suites can't flake the whereis assertions.
🤖 Ultracode review — round 2[MAJOR]
[MINOR]
[MINOR]
|
…ace-tolerant Make the one-shot/repeat ETS asserts tolerate a concurrent async-suite eviction (global :public table can clear-and-restart at the cap between write and lookup), matching the sibling race-tolerant tests. Add a 50-process concurrency test stressing the lazy-creation race and size-check/insert TOCTOU, plus targeted write-path-loss tests that drive maybe_evict_and_insert against a vanished table, asserting correct compiled results and table recreation. Add a 9-byte (first cache-eligible) row adjacent to the 8-byte bypass row in the crossover sweep so the threshold crossover is actually observable, and note the <=8-byte rows measure the bypass as a control.
🤖 Ultracode review — round 3
|
The two through-the-cache tests used the 1-byte pattern ",", which falls under the @cache_min_len bypass and never touches ETS. Switch them to an 11-byte pattern called three times so the third call is a genuine cached hit, and keep separate bypass-path cases for the short pattern.
What changed and why
Adds
compile_cached/1so repeatedstring.match/string.gsubcalls reuse compiled patterns instead of recompiling on every call. Backed by a bounded ETS table (max 512 entries) so the cache can't grow unbounded under adversarial pattern churn.Benchmark delta
Scenario: Bounded compiled-pattern cache —
string.match(s, "%d+")in a 100k loop.origin/main@ 016fa35 (no compile cache). Benchee full mode, 2 runs: ips 15.45 / 16.69; avg 64.74 / 59.91 ms; median 60.38 / 57.16 ms; memory 184.77 MB.perf/pattern-cache@ d13ca6e (compile_cached/1behind bounded ETS cache). Benchee full mode, 2 runs: ips 18.09 / 16.49; avg 55.29 / 60.66 ms; median 53.08 / 57.61 ms; memory 178.99 MB.How it was measured
origin/main016fa35 (exact parent of the branch;git merge-base origin/main perf/pattern-cache== 016fa35). Branch =perf/pattern-cached13ca6e (single commit on top of baseline).benchmarks/pattern_match_cache.exs(NOT committed) using the existingbenchmarks/helpers.exsharness. It evals one Lua chunkreturn run_match(100000)whererun_matchloopsstring.match(s, "%d+")100000 times overs="abc12345def"(constant pattern -> cache hits every iteration).string.match->Pattern.find->compile_cachedon the branch.LUA_BENCH_MODE=full MIX_ENV=benchmark mix run benchmarks/pattern_match_cache.exs(Benchee full profile: 10s time, 2s warmup, 1s memory_time), twice per ref, same machine (darwin, Elixir 1.20/OTP 29).Caveat
Noisy / weakly conclusive — per-run deviation was ±9-37% and the two refs' run ranges overlap, so the ~6% speedup is directional, not decisive. The cleaner signal is memory: a consistent 184.77 -> 178.99 MB (-3.1%) reduction across runs.
Review verdict
External review (the motivation for opening this PR) returned ready, with three nits and no blocking findings:
maybe_evict_and_insertclaims "@cache_max_entries compiled patterns plus the one being inserted" are resident, but it clears BEFORE inserting, so the true max resident is exactly 512, not 513. Doc imprecision only; the bound test (size <= 512) passes.cache_table/0rescue branch re-resolves a lost ETS-creation race via:ets.whereis, which has a vanishingly small theoretical window of returning:undefinedif the racing owner process died instantly betweenArgumentErrorandwhereis. Would crash the next:ets.lookup. Acceptable for a pure-optimization cache; not worth guarding.:ets.whereis+:ets.lookupmiss + insert) versus a barecompile/1. Negligible per-call overhead and a documented trade-off; net win for the repeated-pattern hot loops this targets.No fixes were applied; tests pass after review; final state: ready.
Merge gate
Human-gated merge. This PR is intentionally a draft and must not be auto-merged — a human reviews and merges.
Stabilization + independent verification
Verified on
perf/pattern-cache@ 2a5a3e9 against baselinemain@ 016fa35. Interpreter path measured via the bytecode-strip mechanism (proto.bytecode=nil,:lua_closuretag asserted), 100k loops,MIX_ENV=benchmark LUA_BENCH_MODE=full, back-to-back swap ofpattern.exin one worktree.What changed since first review
<= 8bytes ("%d+",",","%s") compile inline and never touch ETS — verified the table is not even created. This eliminates the original headline inversion.:__seen_once__sentinel and recompiles; only a repeat graduates to a compiled entry, so one-shot/all-distinct patterns are never promoted.:ets.whereis— direct:ets.lookup(@cache_table, ...)with anArgumentErrorrescue; table create only on a miss.Verified numbers (median, interpreter path)
%d+(%d+)-(%d+)-(%d+)T(%d+):(%d+):(%d+)Cache stays bounded: 1024 distinct eligible patterns (2× cap) → table cleared and refilled, ≤512 resident (~64 KB observed).
Honest framing
The win is memory + repeated expensive patterns, not a repeated-trivial-pattern time win (that inverts structurally and is unfixable — a ~35 ns compile cannot be beaten by a ~60–95 ns ETS lookup; the gate now sidesteps it). The all-distinct / miss-heavy path carries a ~20–25% tax, documented in code (not "negligible").
Validation
Full suite 2256 passed / 19 skipped / 1 excluded · lua53 suite 17 passed / 12 skipped · pattern tests 19 passed ·
mix format --check-formattedclean.