Skip to content

perf(vm): rebuild call_stack lazily on the Lua->Lua call path#352

Open
davydog187 wants to merge 5 commits into
mainfrom
perf/executor-lazy-callinfo
Open

perf(vm): rebuild call_stack lazily on the Lua->Lua call path#352
davydog187 wants to merge 5 commits into
mainfrom
perf/executor-lazy-callinfo

Conversation

@davydog187

@davydog187 davydog187 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What changed and why

Rebuilds state.call_stack lazily at native-call boundaries instead of materializing call_info + tuple frames on every Lua->Lua call. The hot recursive Lua path drops two list ops and a 4-key map allocation per call; the cost moves to the (comparatively rare) native boundary via rebuild_call_stack/1. Adds call_iterator/6 to cover the generic_for materialization gap.

Benchmark — interpreter path (the real signal)

The original fib(30) measurement below routed through the dispatcher, which does not exercise this branch's changed code path, so it showed nothing. Re-measured on the tree-walking interpreter (fib stripped to :lua_closure via the benchmarks/dispatcher_vs_interpreter.exs bytecode-strip mechanism), the allocation reduction is clear and deterministic:

ref fib(28) ips fib(28) mem/iter fib(30) mem/iter
baseline 016fa35 2.79 / 2.68 2.84 GB 7.43 GB
this branch 2.83 / 2.81 2.78 GB (-2.1%) 7.29 GB (-1.9%)
  • Memory per iteration: -2% on the hot Lua->Lua path, deterministic (0.00% deviation across runs) — exactly this branch's intent (no per-call call_info map). This is the noise-immune proof.
  • Wall-clock: this branch was faster in both repeat rounds (~1.5-5%), tracking the allocation win rather than thermal luck. Modest but consistently in-favor.

Config: Benchee time:10 warmup:2 memory_time:3, sequential runs, 2 repeats per ref, fib forced through Lua.VM.Executor, correctness asserted (fib(28)==317811, fib(30)==832040). Bench script not committed.

Original (dispatcher-path) measurement — superseded, retained for the record

Scenario: fib(30) recursive via pre-compiled chunk. delta_pct +0.4%, INCONCLUSIVE — this routed through the dispatcher and never touched the changed call path. Superseded by the interpreter-path numbers above.

Review

Motivated by an external code review of the perf branch. Verdict: ready (tests pass after; no fixes required). Two nits, both non-blocking:

  1. nit — The __call metamethod dispatch path (executor.ex ~line 1237) invokes call_function(call_mm, ...) without materializing the lazily-tracked Lua frames into state.call_stack, so a __call callback that reads debug.traceback/getinfo or raises sees a truncated stack (missing in-flight executor frames). Differential testing against origin/main confirmed this is PRE-EXISTING behavior, byte-identical on both branches, NOT a regression. The author added call_iterator/6 to cover the analogous generic_for gap, but the __call path has the same gap on main too; out of scope for this perf change. Worth a follow-up but not a blocker.
  2. nitrebuild_call_stack/1 is O(depth) and is invoked at every native-call boundary (and prepended via ++ at three error raise sites). For native-call-heavy workloads at deep stacks this is more per-native-call work than the old design (which only read the already-materialized list). The net win still holds because the hot Lua->Lua path drops two list ops + a 4-key map alloc per call, and native calls are comparatively rare; but the cost moved from the common path to the native boundary rather than disappearing. Recommend running the run-bench harness on a call-heavy + debug-heavy workload to confirm no regression at the native boundary.

Merge gating

Human-gated merge. Marked ready for review; do NOT merge automatically — merge on maintainer go-ahead after reviewing the diff and the interpreter-path numbers above.

Stabilization + independent verification

Final verified numbers (interpreter path, bytecode-stripped, :lua_closure asserted; MIX_ENV=benchmark LUA_BENCH_MODE=full, Apple M4)

Allocation win (holds, independently reproduced vs main @016fa35):

workload main alloc branch alloc delta main time branch time
fib(28) 2.84 GB 2.78 GB -2.1% 380 ms 359 ms
fib(30) 7.43 GB 7.29 GB -1.9% 997 ms 979 ms

Memory deviation ±0.00% on both (deterministic).

Note: the perf claim was re-measured against a freshly built 016fa35. An earlier "does-not-reproduce" reading was a baseline mismeasurement (the reported main figures were actually the branch's). A clean main build lands at 2.84 / 7.43 GB.

Native-boundary trade-off (documented, accepted): debug.traceback x50 at depth pays for the lazy approach — rebuild_call_stack/1 is O(depth) on demand.

workload main branch
traceback x50 depth40 148 µs / 595 KB 203 µs / 730 KB
traceback x50 depth60 230 µs / 826 KB 302 µs / 1030 KB

This native-boundary cost funds the hot-path allocation win on the Lua→Lua call path.

__call boundary fix (commit cbf7068)

Materializes the lazily-tracked executor frames into state.call_stack at the __call metamethod dispatch site, mirroring the native-dispatch / dispatcher / generic_for boundaries, and adds __call traceback/getinfo/balance coverage to lazy_call_info_test.exs.

Independent verification note: differential probing of the pre-fix commit found the __call path already byte-identical to main (debug.traceback line 7, debug.getinfo(2,'nl').currentline 7, including a 4-deep __call chain), so this change is behavior-neutral / defensive rather than correcting an observed divergence. It is harmless and keeps the lazy bookkeeping balanced across all stack-reading boundaries.

Suite

Full unit: 2259 passed, 19 skipped, 1 excluded. lua53 (--only lua53, errors/db guardrails): 17 passed, 12 skipped. lazy_call_info_test.exs: 12 passed. mix format clean.

The executor's hot :lua_closure call path eagerly allocated a 4-key
call_info map per call and pushed it onto state.call_stack, even though
that map is only read when the stack is actually inspected (debug.getinfo,
debug.traceback, and error tracebacks). Drop the eager push: each frame now
records the call-site line and name_hint (the only bits not derivable from
the caller's proto), and rebuild_call_stack/1 synthesizes the identical
4-key entries on demand at the boundaries that read the stack — native-call
dispatch, the dispatcher hand-off, generic_for iterator calls, and the
:call error raise sites. The O(1) call_depth counter remains the sole
per-call bookkeeping cost.

Traceback and debug.* output are byte-identical to the eager implementation
(verified against main); the lua53 errors.lua/db.lua suites are the guardrail.

Plan: C
@davydog187 davydog187 marked this pull request as ready for review June 10, 2026 22:10
The lazy-call_info refactor covered every boundary that reads
state.call_stack -- native dispatch, dispatcher hand-off, generic_for
iterator calls, and the :call error raise sites -- except the __call
metamethod branch of the :call opcode. That site dispatches through
call_function/3, which can recurse into a Lua closure or run a native
callback (debug.traceback / debug.getinfo) that reads the live stack,
but the executor's in-flight Lua frames were never materialized.

The result was a truncated traceback and a currentline of -1 inside a
__call callback on the interpreter path, diverging from the eager
implementation. Mirror the sibling boundaries: rebuild the frames into
state.call_stack for the duration of the call and restore the inherited
stack on return so the lazy bookkeeping stays balanced.

Verified byte-identical to the pre-refactor executor (currentline 7,
traceback line 7 for the regression probe). The lazy_call_info suite
gains __call traceback/getinfo/balance coverage.

Plan: C
@davydog187

Copy link
Copy Markdown
Contributor Author

🤖 Ultracode review — round 1

  • [MAJOR] lib/lua/vm/executor.ex:1136 (also :1077) — Stack-overflow traceback is empty on the lazy :lua_closure call path

    Confirmed by direct reproduction against the PR head (cbf7068). State.check_call_depth!/1 raises the "stack overflow" error with call_stack: state.call_stack (state.ex:82). The PR stops pushing per-call frames on the :lua_closure path (the frames now live only in the lazy frames argument), so when check_call_depth! fires at executor.ex:1136 state.call_stack is empty. A 50-deep recursion forced down the pure-interpreter :lua_closure path (bytecode stripped — exactly the bytecode-fallback case from compiler/bytecode.ex:370) rendered call_stack length: 0 and emitted no "Stack trace:" section, raised from executor.ex:1136. The identical recursion through the normal compiled/dispatcher path rendered the full 50-frame traceback. The :compiled_closure branch shares the defect: check_call_depth! at line 1077 runs before the frames are materialized into state.call_stack at lines 1085-1091. The error still carries the correct message and value, so this is a traceback-fidelity regression, not a wrong result — but it is reachable by real Lua whenever a recursive function body contains any bytecode-unsupported instruction. The PR's own overflow test only matches /stack overflow/ and never inspects the traceback, so it does not catch this.

    Suggestion: Materialize the lazy frames into the stack used for the overflow report at both depth-check sites. E.g. at line 1136 build overflow_stack = rebuild_call_stack(frames) ++ state.call_stack and have the depth check raise with that (pass it to a check_call_depth! arity that accepts an explicit stack); at line 1077 move/replicate the materialization before the check so the compiled-closure overflow also includes the rebuilt frames. Add a regression test on the pure-interpreter path that asserts the traceback contents (frame count / an in function 'rec' line), not just the message regex.

The executor's hot :lua_closure path tracks in-flight Lua frames lazily
in its frames argument, not state.call_stack. When check_call_depth!
fired on overflow it reported state.call_stack, which on that path is
empty — the rendered stack-overflow error carried no traceback at all,
unlike the eager dispatcher/compiled path which renders the full stack.

Materialize the lazy frames into the reported stack at both depth-check
sites: add check_call_depth!/2 taking an explicit overflow stack, and
call_depth_ok?/1 so the :lua_closure hot path only rebuilds the stack on
the rare failing branch (no success-path cost). The :compiled_closure
path now checks against its already-materialized stack.

Add a regression test on the pure-interpreter path asserting traceback
contents (frame count and 'in function rec' lines), not just the message
regex.
@davydog187

Copy link
Copy Markdown
Contributor Author

🤖 Ultracode review — round 2

  • [MAJOR] test/lua/vm/lazy_call_info_test.exs:1-216 — lazy_call_info_test.exs exercises the :compiled_closure dispatcher branch, not the lazy :lua_closure path it documents

    The moduledoc claims these tests guard the lazily-rebuilt call_stack on the executor's Lua→Lua path where state.call_stack is left untouched — that is the :lua_closure clause at executor.ex:1101, the branch this PR optimizes (frame now carries :line/:name_hint). But every program here uses plain function f()...end/setmetatable closures, and those sub-prototypes are emitted with bytecode ([true, true]) per Compiler.compile. Per executor.ex:1029-1033 a non-nil-bytecode closure is tagged :compiled_closure, so nested calls route through the :compiled_closure hand-off at executor.ex:1068 — a different materialization site. Proof: the suite asserts debug.getinfo currentline == 0 (line 48, compiled path strips line info), whereas forcing the genuine lazy path with a bitwise op (3 & 1, which can't be bytecode-encoded) yields currentline == 3. So getinfo/traceback/__call fidelity and the call_stack-balance assertions on the real lazy :lua_closure branch are unguarded. The path is correct today (forced-lazy getinfo returns ["outer","global",3,"test.lua","Lua"] and call_stack balances to []), so this is a test-coverage gap rather than a live bug. recursion_depth_test.exs already uses the bitwise trick to cover the lazy overflow traceback, so that sub-case is genuinely guarded; the gap is the getinfo/traceback/__call/native-callback assertions. A future change to rebuild_call_stack/1 or the frame shape could break the lazy path with every test in this file still green.

    Suggestion: Add the bitwise-op trick (as in recursion_depth_test.exs's @lazy_recursion) to each scenario so outer/inner/the __call callback/the native-callback deep/outer (lines 169-183) fall onto the :lua_closure path, OR add parallel lazy-variant tests. At minimum cover on the forced-lazy path: debug.getinfo(2,"nSl") fields for a nested chain, debug.traceback line content, the __call callback currentline/traceback, the native-dispatch save/restore (with state.call_stack == [] post-condition), and call_stack balance. Update the moduledoc to match what is actually exercised.

  • [MINOR] lib/lua/vm/executor.ex:1085-1089:compiled_closure hand-off materializes the full call_stack on every success, contradicting the rebuild_call_stack/1 "never on the hot success path" doc

    Verified against PR head 9a50814. The :compiled_closure branch builds materialized_call_stack = [call_info | rebuild_call_stack(frames) ++ inherited_call_stack] unconditionally on every interpreter→dispatcher hand-off (lines 1085-1089), then passes it to check_call_depth!/2. The parallel :lua_closure branch (lines 1144-1151) instead gates rebuild_call_stack(frames) behind State.call_depth_ok?/1, so the O(depth) Enum.map walk only runs on overflow. Meanwhile rebuild_call_stack/1's own doc comment (executor.ex:555-557) states it is "Called only at the boundaries that actually read the stack ... never on the hot success path," and the frame-building comment claims lazy frames are "only materialized into a traceback when the limit is actually hit." Both statements are inaccurate for the :compiled_closure clause, which pays the walk eagerly on every hand-off success. Not a correctness bug — the stack is restored to inherited_call_stack on return (line 1098) and the suite is green — but it is a real per-hand-off cost the doc says is avoided.

    Suggestion: Either gate the materialization behind State.call_depth_ok?/1 the way the :lua_closure clause does (only building materialized_call_stack when actually needed), OR — if the dispatcher genuinely needs a live stack for its whole duration because it can run native callbacks mid-execution — tighten the rebuild_call_stack/1 doc and the frame-building comment to acknowledge that the interpreter→dispatcher boundary materializes eagerly, so the "never on the hot success path" wording is not contradicted.

The existing lazy_call_info tests used plain function/setmetatable
closures, whose sub-prototypes compile WITH bytecode and so route
through the :compiled_closure dispatcher hand-off, not the lazy
:lua_closure interpreter path the module name refers to. Add
forced-lazy variants (sprinkling a non-bytecode-encodable `3 & 1` op
into each function) for the getinfo, traceback, __call, native-callback
balance, and error-raise scenarios, and update the moduledoc to name
both materialization sites.

Also tighten the rebuild_call_stack/1 doc: the interpreter -> dispatcher
hand-off materializes the stack eagerly on every successful call (the
dispatcher pushes its own entries and may run native callbacks that read
it live), so the "never on the hot success path" wording was inaccurate
for that one boundary.
@davydog187

Copy link
Copy Markdown
Contributor Author

🤖 Ultracode review — round 3

Blockers

  • [BLOCKER] lib/lua/vm/executor.ex:2429 (index_value __index), :2520 (table_index __index), :2577 (table_newindex __newindex), :2707/2711/2715 (invoke_metamethod) — Metamethod dispatch (__index/__newindex/arithmetic/concat/__len/comparison) drops enclosing Lua frames from state.call_stack on the lazy path
    • Empirically confirmed regression, not speculation. The PR materializes lazy frames into state.call_stack at the __call boundary (line 1247), native dispatch (1176), the dispatcher hand-off (1090), and call_iterator/6 — but the other metamethod dispatch helpers that re-enter Lua via call_function/3 were missed. index_value/6 (2429), table_index (2520), table_newindex (2577), and invoke_metamethod/4 (2707/2711/2715) are all invoked from inside do_execute opcode clauses (e.g. index_value from 1859/1885/1892/1959/1966/2047) where frames is in scope and holds the live executor stack, but none thread frames in or materialize it. call_function/3 then recurses with frames=[] (executor.ex:157), so a metamethod that reads state.call_stack sees a truncated stack. Reproduced on the PR head (9a50814): a __index metamethod calling debug.traceback('idx') from a forced-lazy Lua function returns "idx\nstack traceback:" (enclosing driver frame GONE) and debug.getinfo(2,'nl').currentline returns -1. On merge-base 016fa35 the same program returns "idx\nstack traceback:\n\ttest.lua:9: in ?" and getinfo(2) returns 9. This breaks Lua 5.3 debug semantics and corrupts error tracebacks raised from inside metamethods on the lazy interpreter path.
    • Suggestion: Thread frames into index_value/6, table_index, table_newindex, and invoke_metamethod/4 and wrap the call_function/3 calls at 2429/2520/2577/2707-2715 with the same inherited-stack/rebuild_call_stack(frames)/restore dance already used for __call at 1247. Add regression tests asserting debug.traceback / debug.getinfo / error tracebacks inside __index, __newindex, and an arithmetic metamethod still observe the enclosing Lua frame, on both the compiled and forced-lazy paths.

Minor

  • [MINOR] test/lua/vm/lazy_call_info_test.exs:1-377 — No regression test for the generic_for iterator materialization boundary (call_iterator/6)
    • The diff introduces call_iterator/6 (executor.ex:582-588) specifically to materialize lazy frames into state.call_stack for a generic_for iterator call, and rewires both for-loop call sites to it (executor.ex ~689 and ~974). This is one of the five read boundaries the PR's own moduledoc enumerates, yet no test in the new file (or recursion_depth_test.exs) drives a for ... in loop combined with debug.* — confirmed by grep returning nothing. The path is currently correct, so this is a missing regression test rather than a live bug; but finding Pull in code #1 demonstrates that exactly these untested materialization boundaries are where the lazy refactor silently truncates tracebacks.
    • Suggestion: Add a describe block driving for _, v in iter, nil, 0 do where iter calls debug.traceback / debug.getinfo(2,'nl').currentline, assert the enclosing frame's call-site line appears, cover both compiled and forced-lazy variants, and assert call_stack restores to [] afterward to pin the inherited-stack restore in call_iterator/6.

Nits

  • [NIT] test/lua/vm/lazy_call_info_test.exs:68-78 (and every forced-lazy test) — Lazy-path tests force the interpreter via the implicit 3 & 1 codegen quirk instead of the existing strip_bytecode/1 helper
    • Every forced-lazy test sprinkles local _ = 3 & 1 to force the :lua_closure path, relying on bitwise ops being non-bytecode-encodable. If a future codegen change makes & encodable, these tests silently revert to the compiled path and stop exercising the lazy branch they are named for, with no failing signal. The repo already has an intent-revealing alternative in pcall_state_preservation_test.exs:44-48 (strip_bytecode/1 recursively nils proto.bytecode) run under for engine <- [:compiled, :interpreted], which forces the interpreter path structurally and independent of opcode coverage.
    • Suggestion: Adopt the strip_bytecode/1 + per-engine for engine <- [:compiled, :interpreted] pattern from pcall_state_preservation_test.exs so the lazy path is forced structurally, making coverage robust to codegen changes and removing the duplicate golden line numbers per variant.

Metamethod dispatch that re-enters Lua via call_function/3 — __index,
__newindex, arithmetic/bitwise/comparison/concat/__len, and the
generic_for iterator — ran from inside the interpreter opcode clauses
where the in-flight Lua frames are tracked lazily in the do_execute
`frames` argument and are NOT in state.call_stack. The leaf helpers
never threaded `frames` in, so call_function/3 recursed with frames=[]
and a metamethod reading state.call_stack (debug.traceback /
debug.getinfo, error tracebacks) saw a truncated stack with the
enclosing Lua frame dropped.

Thread `frames` through index_value, table_index, table_newindex,
invoke_metamethod and the try_*/compare_le helpers, and route the
call_function/3 leaf sites through a new call_metamethod/4 that performs
the same inherited-stack / rebuild_call_stack / restore dance already
used for the __call, native-dispatch, dispatcher hand-off, and
generic_for boundaries. The public table_index/3, table_newindex/4 and
table_length/2 arities (reached from the dispatcher and stdlib, where
state.call_stack is already materialized) pass :materialized so the
dance is a no-op.

Add regressions in lazy_call_info_test asserting debug.traceback /
debug.getinfo currentline observe the enclosing Lua frame inside
__index, __newindex and an arithmetic metamethod, plus the generic_for
iterator boundary, on both the compiled and forced-lazy paths, and that
call_stack restores to [] afterward.
@davydog187

Copy link
Copy Markdown
Contributor Author

🤖 Ultracode review — round 1

  • [NIT] test/lua/vm/lazy_call_info_test.exs:262-296 — No regression test for a native-raised error's traceback relying on the line-1197 stack materialization

    The native-dispatch boundary materializes the lazy executor frames into state.call_stack at executor.ex:1197 (rebuild_call_stack(frames) ++ inherited). On the error-from-native path, that materialized state is attached to the raised exception via annotate_frame_state/2 (executor.ex:1213, fills :state only when nil). This is the sole source of traceback fidelity for that path: a forced-lazy chain outer -> inner -> ("x"):rep(true) raises with the exception's OWN :call_stack empty (length 0) while the attached :state.call_stack carries the 2 outer frames. This boundary is distinct from both the success-path getinfo/traceback reads (which read the live stack before any raise) and the executor's own raise sites at executor.ex:1224/1236/1248 (which use the error's own :call_stack and ARE covered by the "attempt to call a nil value" tests). A regression in the line-1197 materialization would silently truncate stdlib-error tracebacks with no failing test. The code is correct today; this is purely a coverage gap on a perf PR whose stated risk is exactly "a consumer that reads the stack observes a truncated stack".

    Suggestion: Add a small regression: a forced-lazy nested chain (outer -> inner) where inner triggers a native stdlib arg error (e.g. ("x"):rep(true)), and assert the rendered traceback / attached state.call_stack contains the outer frame. This pins the native-boundary materialization on the error path, which the existing success-path getinfo tests do not exercise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant