Skip to content

perf(pattern): cache compiled patterns behind a bounded ETS table#348

Open
davydog187 wants to merge 5 commits into
mainfrom
perf/pattern-cache
Open

perf(pattern): cache compiled patterns behind a bounded ETS table#348
davydog187 wants to merge 5 commits into
mainfrom
perf/pattern-cache

Conversation

@davydog187

@davydog187 davydog187 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What changed and why

Adds compile_cached/1 so repeated string.match/string.gsub calls reuse compiled patterns instead of recompiling on every call. Backed by a bounded ETS table (max 512 entries) so the cache can't grow unbounded under adversarial pattern churn.

Benchmark delta

Scenario: Bounded compiled-pattern cache — string.match(s, "%d+") in a 100k loop.

  • Baseline: origin/main @ 016fa35 (no compile cache). Benchee full mode, 2 runs: ips 15.45 / 16.69; avg 64.74 / 59.91 ms; median 60.38 / 57.16 ms; memory 184.77 MB.
  • Branch: perf/pattern-cache @ d13ca6e (compile_cached/1 behind bounded ETS cache). Benchee full mode, 2 runs: ips 18.09 / 16.49; avg 55.29 / 60.66 ms; median 53.08 / 57.61 ms; memory 178.99 MB.
  • Delta: -5.8% on the median (per-ref median avg 58.77 ms -> 55.35 ms); average-based delta is -7.0%.

How it was measured

  • Refs: baseline = origin/main 016fa35 (exact parent of the branch; git merge-base origin/main perf/pattern-cache == 016fa35). Branch = perf/pattern-cache d13ca6e (single commit on top of baseline).
  • Wrote benchmarks/pattern_match_cache.exs (NOT committed) using the existing benchmarks/helpers.exs harness. It evals one Lua chunk return run_match(100000) where run_match loops string.match(s, "%d+") 100000 times over s="abc12345def" (constant pattern -> cache hits every iteration). string.match -> Pattern.find -> compile_cached on the branch.
  • Ran on each ref with LUA_BENCH_MODE=full MIX_ENV=benchmark mix run benchmarks/pattern_match_cache.exs (Benchee full profile: 10s time, 2s warmup, 1s memory_time), twice per ref, same machine (darwin, Elixir 1.20/OTP 29).

Caveat

Noisy / weakly conclusive — per-run deviation was ±9-37% and the two refs' run ranges overlap, so the ~6% speedup is directional, not decisive. The cleaner signal is memory: a consistent 184.77 -> 178.99 MB (-3.1%) reduction across runs.

Review verdict

External review (the motivation for opening this PR) returned ready, with three nits and no blocking findings:

  • nit: Eviction comment in maybe_evict_and_insert claims "@cache_max_entries compiled patterns plus the one being inserted" are resident, but it clears BEFORE inserting, so the true max resident is exactly 512, not 513. Doc imprecision only; the bound test (size <= 512) passes.
  • nit: cache_table/0 rescue branch re-resolves a lost ETS-creation race via :ets.whereis, which has a vanishingly small theoretical window of returning :undefined if the racing owner process died instantly between ArgumentError and whereis. Would crash the next :ets.lookup. Acceptable for a pure-optimization cache; not worth guarding.
  • nit: Non-target / single-use patterns now pay two extra ETS ops (:ets.whereis + :ets.lookup miss + insert) versus a bare compile/1. Negligible per-call overhead and a documented trade-off; net win for the repeated-pattern hot loops this targets.

No fixes were applied; tests pass after review; final state: ready.

Merge gate

Human-gated merge. This PR is intentionally a draft and must not be auto-merged — a human reviews and merges.

Stabilization + independent verification

Verified on perf/pattern-cache @ 2a5a3e9 against baseline main @ 016fa35. Interpreter path measured via the bytecode-strip mechanism (proto.bytecode=nil, :lua_closure tag asserted), 100k loops, MIX_ENV=benchmark LUA_BENCH_MODE=full, back-to-back swap of pattern.ex in one worktree.

What changed since first review

  • Trivial patterns bypass the cache. Patterns <= 8 bytes ("%d+", ",", "%s") compile inline and never touch ETS — verified the table is not even created. This eliminates the original headline inversion.
  • Second-sighting insertion. First sighting records a cheap :__seen_once__ sentinel and recompiles; only a repeat graduates to a compiled entry, so one-shot/all-distinct patterns are never promoted.
  • Hit path drops the per-call :ets.whereis — direct :ets.lookup(@cache_table, ...) with an ArgumentError rescue; table create only on a miss.
  • Comment fixes: true max resident is exactly 512; TOCTOU under concurrent writers is documented as benign (worst case = extra recompile).

Verified numbers (median, interpreter path)

Workload (100k) Baseline Branch Time Δ Mem Δ
Repeated trivial %d+ 59.22 ms / 0.40 GB 59.73 ms / 411 MB +0.9% (parity) parity
Repeated expensive datetime (%d+)-(%d+)-(%d+)T(%d+):(%d+):(%d+) 234.31 ms / 1117 MB 155.18 ms / 959 MB −34% time −14% mem
All-distinct (miss path) 179.00 ms / 1.06 GB 214.09 ms / 1089 MB +20% time parity

Cache stays bounded: 1024 distinct eligible patterns (2× cap) → table cleared and refilled, ≤512 resident (~64 KB observed).

Honest framing

The win is memory + repeated expensive patterns, not a repeated-trivial-pattern time win (that inverts structurally and is unfixable — a ~35 ns compile cannot be beaten by a ~60–95 ns ETS lookup; the gate now sidesteps it). The all-distinct / miss-heavy path carries a ~20–25% tax, documented in code (not "negligible").

Validation

Full suite 2256 passed / 19 skipped / 1 excluded · lua53 suite 17 passed / 12 skipped · pattern tests 19 passed · mix format --check-formatted clean.

Pattern.compile/1 ran on every string.find/match/gsub call. Add
compile_cached/1, a memoized front for the pure compile/1, backed by a
lazily-created bounded ETS set keyed on the raw pattern binary. Repeated
-pattern loops now compile once and read thereafter; an adversarial
varied-pattern stream is capped by clear-and-restart eviction. gmatch
keeps compiling directly since it already compiles once per iterator.

Plan: F
The bounded compiled-pattern cache regressed trivial repeated patterns
(string.match(s, "%d+")): a single ETS lookup (~60-95 ns) costs more
than recompiling a short pattern (~35 ns), so the cache was pure overhead
on the exact hot loop it targeted, and taxed miss-heavy varied-pattern
workloads by ~25%.

Two guards keep the cache where it earns its keep:

  1. Patterns <= 8 bytes bypass the cache entirely and compile inline
     (one byte_size guard, no ETS call). Trivial patterns now match the
     bare compile/1 baseline.
  2. Eligible patterns are inserted only on their second sighting; the
     first sighting records a cheap sentinel. One-shot patterns never get
     promoted to a full compiled entry, keeping the all-miss path closer
     to bare compile/1.

The hit path now addresses the named table directly via
:ets.lookup(@cache_table, ...) instead of resolving it through
:ets.whereis on every call; the slow path falls back to whereis/create.

Re-benchmarked on the interpreter path (100k loops, full mode, median):
repeated %d+ 58.9 -> 60.1 ms (parity, within +-5% noise), repeated
expensive datetime pattern 136.2 -> 125.3 ms (-8% time) and 771 -> 606 MB
(-21% memory) -- the real win -- and the all-distinct miss path 141.4 ->
176.5 ms (+25%, now documented honestly, not 'negligible'). Cache stays
bounded at 512 entries.

Tightens the eviction comment (true max resident is exactly 512, cleared
before insert) and notes the benign racy-but-safe TOCTOU under concurrent
writers.
@davydog187 davydog187 marked this pull request as ready for review June 10, 2026 22:41
case :ets.whereis(@cache_table) do
:undefined ->
try do
:ets.new(@cache_table, [

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tables are owned by their creating processes. I'm not so sure that we want to create a table like this. Maybe this is an opt in feature and the user must put Lua in their supervision tree or something

https://www.erlang.org/doc/apps/stdlib/ets.html#new/2

I think there's also a problem there if the table gets too big, there's a resource leak issue, so we'd want a max size or something on the cache

@davydog187

Copy link
Copy Markdown
Contributor Author

🤖 Ultracode review — round 1

All three findings are minor (no blockers). The pattern cache is correctness-transparent — a wrong threshold or a lost creation race changes speed, never results — so these are confidence/coverage gaps rather than defects.

  • [MINOR] lib/lua/vm/stdlib/pattern.ex:19-55,374 — Pattern-cache perf claims and @cache_min_len=8 are unbacked by any committed benchmark

    • The cost-gating design rests on numbers asserted only in code comments (ETS lookup ~60-95 ns, trivial compile ~35 ns), and no committed benchmark exercises pattern compile vs ETS lookup. benchmarks/string_ops.exs only mentions compile in a comment (line 8) and does not exercise find/gsub/gmatch/compile. The PR's own stated risk is that caching can regress a hot loop of trivial patterns, yet nothing measures the chosen threshold.
    • Suggestion: Capture the compile-vs-lookup measurement in the PR description so the @cache_min_len=8 choice is auditable, or add a Benchee scenario under benchmarks/ exercising compile/1 vs compile_cached/1 for short/long and hit/miss/one-shot workloads.
  • [MINOR] test/lua/vm/stdlib/pattern_test.exs:2,75-105async: true test writes the global cache table that the async: false cache test asserts on

    • pattern_test.exs is use ExUnit.Case, async: true (line 2), and its new cache-transparency block compiles cache-eligible (>8 byte) patterns such as "%d+%.%d+" and "(%a+),(%d+)", populating and leaving entries in the shared global :lua_pattern_cache table. PatternCacheTest is deliberately async: false because it asserts on that same global table's size/eviction. It stays green only because ExUnit runs all async modules before sync ones and the sync tests drop_table() first — the isolation is implicit and relies on phase ordering rather than being enforced.
    • Suggestion: Confine global-table mutation to the non-async module: move the cache-transparency block into the async: false PatternCacheTest, or add a drop_table/on_exit cleanup to pattern_test.exs's cache block so the isolation is explicit.
  • [MINOR] test/lua/vm/stdlib/pattern_cache_test.exs:57-68 — Startup table-creation race and concurrent-writer path the PR claims safe are untested

    • cache_table/0 (pattern.ex ~425) has a rescue branch re-resolving :ets.whereis after losing the :ets.new race, and maybe_evict_and_insert (pattern.ex ~470) carries an acknowledged benign TOCTOU under concurrent writers. The PR explicitly names multi-process concurrency and a startup table-creation race as central risks, yet the only table-loss test (57-68) deletes/recreates from a single process. The race-loss rescue branch and concurrent insert path are entirely uncovered.
    • Suggestion: Add a test that drops the table then spawns N processes (e.g. Task.async_stream) all calling compile_cached/1 on the same long pattern, asserting every result equals compile/1 and the table ends up created exactly once — exercising both the creation race and the concurrent insert path.

@davydog187

Copy link
Copy Markdown
Contributor Author

🤖 Ultracode review — round 1

Major

  • [MAJOR] lib/lua/vm/stdlib/pattern.ex (@cache_min_len / @cache_max_entries + moduledoc cache notes) — Performance claims and the @cache_min_len 8 threshold have no reproducible benchmark backing them

    • This is a performance PR whose central tuning constant (@cache_min_len 8) and headline numbers exist only in prose. The moduledoc cites ~60-95 ns ETS lookups vs ~35 ns trivial compiles and a "crossover around a handful of bytes"; the commit bodies cite 58.9 -> 60.1 ms, 136.2 -> 125.3 ms (-8%), 771 -> 606 MB (-21%), +25% on the miss path. None is reproducible: grepping every file in benchmarks/ for find|gsub|gmatch|match|pattern yields only Enum.find and string.format specifiers (%d in format strings), never a Lua pattern operation. No committed harness produced these numbers or lets a reviewer/CI re-validate the 8-byte crossover or the 512 cap, so the magic constants are unverifiable and will silently rot. The repo has a run-bench skill pointing at benchmarks/, which makes the absence of a pattern benchmark more notable.
    • Suggestion: Add benchmarks/pattern_ops.exs in the existing Benchee style covering the three workloads the commit names: repeated trivial pattern (%d+), repeated expensive pattern (the datetime example), and an all-distinct miss stream; sweep pattern length around the threshold so 8 is demonstrably the crossover on this engine. At minimum, move the measurement methodology into a committed file rather than leaving it only in commit prose.
  • [MAJOR] test/lua/vm/stdlib/pattern_test.exs (describe "compiled-pattern cache transparency", the @patterns list) — "compile_cached returns the same tuple" transparency test exercises only the cache-bypass path, never the cached path

    • The byte lengths of @patterns (["^a","[%w]","(%a+)","%bxy","a*b-c?","()",",","%d+%.%d+"]) are 2,4,5,4,6,2,1,8 — every one is <= @cache_min_len (8), so compile_cached/1 always hits the first clause (byte_size <= 8 -> compile inline) and never touches ETS. The test comment claims it crosses "first-sighting -> second-sighting -> cached-hit transitions", but for these inputs there are zero transitions: the sentinel/promotion/hit machinery for anchored ^, balanced %bxy, position capture (), and quantifiers is never validated through the cache. The only elements that actually reach the cached path anywhere in the PR are the >8-byte comma-capture and synthetic pattern_number_N_%d+/@long strings — so anchored, %b, and position-capture semantics are never asserted tuple-identical on a genuine cache hit, which is exactly what this test claims to prove.
    • Suggestion: Add >8-byte variants of each element type to the transparency list (e.g. "^abcdefgh", "[%w][%w][%w]", "%bxy_padding", "()abcdef()", "a*b-c?d+e") so the second and third calls actually return a sentinel-promoted then a cached entry, and assert tuple-identity across all three calls.

Minor

  • [MINOR] lib/lua/vm/stdlib/pattern.ex (compile_cached/1 + maybe_evict_and_insert/3 + cache_table/0, commit 2a5a3e9) — Write path is not crash-safe against concurrent table deletion, contradicting the documented "never a correctness dependency" / "transparently recreated" guarantee

    • Verified against the diff and the live Erlang runtime. The READ in compile_cached/1 is guarded (:ets.lookup wrapped in try/rescue ArgumentError -> :no_table), but the WRITE is not. On a sentinel/miss branch the code calls maybe_evict_and_insert(cache_table(), ...), which does :ets.info(table, :size) >= @cache_max_entries then :ets.delete_all_objects / :ets.insert with no rescue. If the table is deleted between cache_table/0 resolving it and these calls, they raise. Confirmed on this runtime: :ets.info(missing, :size) returns :undefined, :undefined >= 512 is true (term ordering), so it proceeds to :ets.delete_all_objects/:ets.insert on a dead table, both of which raise ArgumentError. The moduledoc asserts the cache is "a pure optimization, never a correctness dependency" and that a lost table is "transparently recreated", but that resilience only holds on reads. Production never deletes the table, but pattern_cache_test.exs deletes it directly (:ets.delete) and other async suites write concurrently, so the window is reachable under the suite's own behavior.
    • Suggestion: Either (a) wrap the body of maybe_evict_and_insert/3 in try/rescue ArgumentError that recreates the table and retries the insert once (mirroring the read-path treatment) so the advertised "never a correctness dependency" invariant actually holds, or (b) narrow the moduledoc to say table loss is tolerated on reads only and that no code path may delete the table concurrently with use.
  • [MINOR] test/lua/vm/stdlib/pattern_cache_test.exs ("trivial patterns ... never touch ETS" and "recreates the table after it is lost") — Cache-table tests assert on a global named ETS table that async suites write concurrently, so the whereis == :undefined assertions can flake

    • The cache is a single process-agnostic named table :lua_pattern_cache. pattern_cache_test.exs is async: false, but test/lua/vm/string_test.exs and test/language/stdlib/string_test.exs are async: true and drive string.find/match/gsub through compile_cached with patterns >8 bytes (e.g. "(%d+)-(%d+)-(%d+)"), which create/populate that same table. (gmatch is exempt — it bypasses the cache via compile/1.) The cache tests do drop_table() then assert :ets.whereis(@cache_table) == :undefined; an async test that compiles a >8-byte pattern in that window recreates the table and breaks the assertion. async: false only serializes against other async: false tests, not the async pool. The bounded-stream size assertion is similarly exposed to concurrent inserts.
    • Suggestion: Make the assertions robust to concurrent recreation — assert the specific key is absent / the table does not contain it rather than that the whole named table is :undefined — or isolate these tests (e.g. a tag run outside the async pool). For the bounded test, assert the count of the test's own keys rather than total table size.

… benchmark

Address review feedback on the compiled-pattern cache:

- maybe_evict_and_insert/3 now rescues ArgumentError (table deleted
  out from under it), recreates the table, and retries the insert once,
  mirroring the read path's :no_table handling. The moduledoc's "never a
  correctness dependency" / "transparently recreated" guarantee now holds
  on writes, not just reads.
- Add benchmarks/pattern_ops.exs: a length sweep that exposes the
  @cache_min_len=8 compile-vs-cached crossover plus trivial / expensive /
  miss-stream gsub workloads exercising the @cache_max_entries=512 cap, so
  the tuning constants are reproducible instead of prose-only. Moduledoc
  now points at the harness.
- Transparency test: split into bypass (<=8 byte) and cached (>8 byte)
  lists; the cached list has a >8-byte variant of every element type
  (anchored, set, capture, %b, quantifiers, position capture) so the
  sentinel -> promotion -> hit transitions are asserted tuple-identical
  through ETS, not just on the bypass path.
- Cache-table tests: assert the specific key is absent rather than that
  the whole global named table is :undefined, so concurrent recreation by
  async suites can't flake the whereis assertions.
@davydog187

Copy link
Copy Markdown
Contributor Author

🤖 Ultracode review — round 2

[MAJOR] test/lua/vm/stdlib/pattern_cache_test.exs:1 — Concurrency and crash-safe write paths have zero test coverage despite being the PR's headline risk

Every test runs in a single process (async:false). pattern.ex documents and codes for three concurrent hazards that are never exercised: the table-creation race rescue (cache_table/0, rescue ArgumentError -> re-resolve via :ets.whereis), the benign size-check/insert TOCTOU under write_concurrency, and the write-path crash-safe rescue in maybe_evict_and_insert. The only table-loss test routes exclusively through the READ path: it deletes the table then calls compile_cached, which raises in :ets.lookup -> :no_table branch -> maybe_mark_seen(cache_table(), ...), and cache_table/0 recreates the table BEFORE :ets.insert, so the write-path rescue (the change the "make cache write path crash-safe" commit advertises) never sees a dead table and is dead from the suite's perspective. A regression flipping :public to :protected, dropping a rescue, or mishandling the lost-creation-race re-resolve would fail no test.

Suggestion: add (a) a multi-process test: spawn ~50 Tasks calling Pattern.compile_cached/1 on a mix of shared-hot and per-task-unique patterns after dropping the table, asserting every result equals Pattern.compile/1 and :ets.info(size) stays <= @cache_max_entries (stresses the creation race + TOCTOU); and (b) a targeted write-path-loss test that puts a pattern into the @seen_once state, deletes the table, then drives a second compile_cached so the insert encounters a missing table, asserting it returns the correct compiled tuple and recreates the table (covers maybe_evict_and_insert's rescue).

[MINOR] test/lua/vm/stdlib/pattern_cache_test.exs:45-63 — One-shot / repeat cache tests assert exact single-key ETS contents that a concurrent async writer can evict mid-test

:lua_pattern_cache is global, :public, and shared suite-wide. 86 async:true modules exist and several (string_test, pattern_test, etc.) drive >8-byte patterns through compile_cached via string.find/gsub/match. ExUnit runs async modules concurrently and they CAN overlap a sync (async:false) module's tests, so a concurrent eviction (delete_all_objects at the 512 cap) can fire between this test's write and its :ets.lookup. The "one-shot" test (assert [{^p, :__seen_once__}] = :ets.lookup) and "repeated" test (assert [{^p, compiled}] = :ets.lookup) assert the table holds exactly one specific tuple for their key; the sibling bounded-cache and table-loss tests were deliberately written race-tolerant, these two were not. This is a flakiness window only — the cache returns correct results regardless.

Suggestion: assert the observable contract rather than internal ETS state, matching the sibling tests: tolerate a vanished entry, e.g. case :ets.lookup(@cache_table, p) do [{^p, v}] -> assert v == expected; [] -> :ok end, and rely on the existing tuple-identity assertions (compile_cached(p) == compile(p)) for the promotion contract.

[MINOR] benchmarks/pattern_ops.exs:38 — Benchmark crossover sweep compares compile/1 against itself for the four rows at/below @cache_min_len

The bypass guard is byte_size(pattern) <= @cache_min_len (8, inclusive), so len=2/4/6/8 all bypass the cache and compile_cached/1 runs the identical compile/1 code as the "no cache" job; the two warming calls are no-ops for these rows. Only len=12/24/48 exercise a genuine cached hit. The file header states the sweep makes the crossover "observable on this engine instead of asserted" and labels the second column "compile_cached/1 (warm hit)", but for the rows bracketing the threshold below it there is no warm hit and no crossover can appear — the first genuinely-cached length (12) sits far from the last bypassed length (8). Benchmark-only; no product impact.

Suggestion: add a row just above the threshold (e.g. a 9- or 10-byte cache-eligible pattern) so the first genuinely-cached length is adjacent to the last bypassed one and the crossover is actually visible, and/or note in the header that the <=8-byte rows intentionally measure the bypass (compile == compile) as a control.

…ace-tolerant

Make the one-shot/repeat ETS asserts tolerate a concurrent async-suite
eviction (global :public table can clear-and-restart at the cap between
write and lookup), matching the sibling race-tolerant tests.

Add a 50-process concurrency test stressing the lazy-creation race and
size-check/insert TOCTOU, plus targeted write-path-loss tests that drive
maybe_evict_and_insert against a vanished table, asserting correct
compiled results and table recreation.

Add a 9-byte (first cache-eligible) row adjacent to the 8-byte bypass row
in the crossover sweep so the threshold crossover is actually observable,
and note the <=8-byte rows measure the bypass as a control.
@davydog187

Copy link
Copy Markdown
Contributor Author

🤖 Ultracode review — round 3

  • [MINOR] test/lua/vm/stdlib/pattern_test.exs:124-131 — "through the cache" tests use a cache-bypassing 1-byte pattern, so the cached find/gsub hit path is never exercised
    • Both string.find through the cache matches the uncached engine (lines 124-126) and string.gsub through the cache matches the uncached engine (lines 129-131) use the pattern "," (1 byte). compile_cached/1's first clause guards when byte_size(pattern) <= @cache_min_len (8), so "," takes the inline compile/1 bypass and never touches ETS. These tests assert correctness of the bypass path, not "through the cache" as their names claim. The genuinely-cached path — a >8-byte pattern served from a third-sighting ETS hit via find/gsub — is not covered end-to-end here. Risk is low because the compile_cached is tuple-identical across the cache transitions test already proves compile_cached returns a tuple-identical result to compile for long patterns and find/gsub consume that tuple directly, but the named integration tests give false confidence and a regression in the ETS hit path would not be caught by them.
    • Suggestion: Change these two tests to use a >8-byte pattern (e.g. a multi-char capture or longer literal) and call find/gsub at least three times so the third call is served from a real ETS hit, asserting the match result equals the uncached engine on every call. If you want bypass coverage too, keep a short-pattern case but rename it to reflect that it exercises the bypass path.

The two through-the-cache tests used the 1-byte pattern ",", which falls
under the @cache_min_len bypass and never touches ETS. Switch them to an
11-byte pattern called three times so the third call is a genuine cached
hit, and keep separate bypass-path cases for the short pattern.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant