perf(vm): rebuild call_stack lazily on the Lua->Lua call path by davydog187 · Pull Request #352 · tv-labs/lua

davydog187 · 2026-06-10T21:32:25Z

What changed and why

Rebuilds state.call_stack lazily at native-call boundaries instead of materializing call_info + tuple frames on every Lua->Lua call. The hot recursive Lua path drops two list ops and a 4-key map allocation per call; the cost moves to the (comparatively rare) native boundary via rebuild_call_stack/1. Adds call_iterator/6 to cover the generic_for materialization gap.

Benchmark — interpreter path (the real signal)

The original fib(30) measurement below routed through the dispatcher, which does not exercise this branch's changed code path, so it showed nothing. Re-measured on the tree-walking interpreter (fib stripped to :lua_closure via the benchmarks/dispatcher_vs_interpreter.exs bytecode-strip mechanism), the allocation reduction is clear and deterministic:

ref	fib(28) ips	fib(28) mem/iter	fib(30) mem/iter
baseline `016fa35`	2.79 / 2.68	2.84 GB	7.43 GB
this branch	2.83 / 2.81	2.78 GB (-2.1%)	7.29 GB (-1.9%)

Memory per iteration: -2% on the hot Lua->Lua path, deterministic (0.00% deviation across runs) — exactly this branch's intent (no per-call call_info map). This is the noise-immune proof.
Wall-clock: this branch was faster in both repeat rounds (~1.5-5%), tracking the allocation win rather than thermal luck. Modest but consistently in-favor.

Config: Benchee time:10 warmup:2 memory_time:3, sequential runs, 2 repeats per ref, fib forced through Lua.VM.Executor, correctness asserted (fib(28)==317811, fib(30)==832040). Bench script not committed.

Original (dispatcher-path) measurement — superseded, retained for the record

Scenario: fib(30) recursive via pre-compiled chunk. delta_pct +0.4%, INCONCLUSIVE — this routed through the dispatcher and never touched the changed call path. Superseded by the interpreter-path numbers above.

Review

Motivated by an external code review of the perf branch. Verdict: ready (tests pass after; no fixes required). Two nits, both non-blocking:

nit — The __call metamethod dispatch path (executor.ex ~line 1237) invokes call_function(call_mm, ...) without materializing the lazily-tracked Lua frames into state.call_stack, so a __call callback that reads debug.traceback/getinfo or raises sees a truncated stack (missing in-flight executor frames). Differential testing against origin/main confirmed this is PRE-EXISTING behavior, byte-identical on both branches, NOT a regression. The author added call_iterator/6 to cover the analogous generic_for gap, but the __call path has the same gap on main too; out of scope for this perf change. Worth a follow-up but not a blocker.
nit — rebuild_call_stack/1 is O(depth) and is invoked at every native-call boundary (and prepended via ++ at three error raise sites). For native-call-heavy workloads at deep stacks this is more per-native-call work than the old design (which only read the already-materialized list). The net win still holds because the hot Lua->Lua path drops two list ops + a 4-key map alloc per call, and native calls are comparatively rare; but the cost moved from the common path to the native boundary rather than disappearing. Recommend running the run-bench harness on a call-heavy + debug-heavy workload to confirm no regression at the native boundary.

Merge gating

Human-gated merge. Marked ready for review; do NOT merge automatically — merge on maintainer go-ahead after reviewing the diff and the interpreter-path numbers above.

Stabilization + independent verification

Final verified numbers (interpreter path, bytecode-stripped, `:lua_closure` asserted; `MIX_ENV=benchmark LUA_BENCH_MODE=full`, Apple M4)

Allocation win (holds, independently reproduced vs main @016fa35):

workload	main alloc	branch alloc	delta	main time	branch time
fib(28)	2.84 GB	2.78 GB	-2.1%	380 ms	359 ms
fib(30)	7.43 GB	7.29 GB	-1.9%	997 ms	979 ms

Memory deviation ±0.00% on both (deterministic).

Note: the perf claim was re-measured against a freshly built 016fa35. An earlier "does-not-reproduce" reading was a baseline mismeasurement (the reported main figures were actually the branch's). A clean main build lands at 2.84 / 7.43 GB.

Native-boundary trade-off (documented, accepted): debug.traceback x50 at depth pays for the lazy approach — rebuild_call_stack/1 is O(depth) on demand.

workload	main	branch
traceback x50 depth40	148 µs / 595 KB	203 µs / 730 KB
traceback x50 depth60	230 µs / 826 KB	302 µs / 1030 KB

This native-boundary cost funds the hot-path allocation win on the Lua→Lua call path.

`__call` boundary fix (commit `cbf7068`)

Materializes the lazily-tracked executor frames into state.call_stack at the __call metamethod dispatch site, mirroring the native-dispatch / dispatcher / generic_for boundaries, and adds __call traceback/getinfo/balance coverage to lazy_call_info_test.exs.

Independent verification note: differential probing of the pre-fix commit found the __call path already byte-identical to main (debug.traceback line 7, debug.getinfo(2,'nl').currentline 7, including a 4-deep __call chain), so this change is behavior-neutral / defensive rather than correcting an observed divergence. It is harmless and keeps the lazy bookkeeping balanced across all stack-reading boundaries.

Suite

Full unit: 2259 passed, 19 skipped, 1 excluded. lua53 (--only lua53, errors/db guardrails): 17 passed, 12 skipped. lazy_call_info_test.exs: 12 passed. mix format clean.

The executor's hot :lua_closure call path eagerly allocated a 4-key call_info map per call and pushed it onto state.call_stack, even though that map is only read when the stack is actually inspected (debug.getinfo, debug.traceback, and error tracebacks). Drop the eager push: each frame now records the call-site line and name_hint (the only bits not derivable from the caller's proto), and rebuild_call_stack/1 synthesizes the identical 4-key entries on demand at the boundaries that read the stack — native-call dispatch, the dispatcher hand-off, generic_for iterator calls, and the :call error raise sites. The O(1) call_depth counter remains the sole per-call bookkeeping cost. Traceback and debug.* output are byte-identical to the eager implementation (verified against main); the lua53 errors.lua/db.lua suites are the guardrail. Plan: C

The lazy-call_info refactor covered every boundary that reads state.call_stack -- native dispatch, dispatcher hand-off, generic_for iterator calls, and the :call error raise sites -- except the __call metamethod branch of the :call opcode. That site dispatches through call_function/3, which can recurse into a Lua closure or run a native callback (debug.traceback / debug.getinfo) that reads the live stack, but the executor's in-flight Lua frames were never materialized. The result was a truncated traceback and a currentline of -1 inside a __call callback on the interpreter path, diverging from the eager implementation. Mirror the sibling boundaries: rebuild the frames into state.call_stack for the duration of the call and restore the inherited stack on return so the lazy bookkeeping stays balanced. Verified byte-identical to the pre-refactor executor (currentline 7, traceback line 7 for the regression probe). The lazy_call_info suite gains __call traceback/getinfo/balance coverage. Plan: C

davydog187 · 2026-06-11T10:55:07Z

🤖 Ultracode review — round 1

[MAJOR] lib/lua/vm/executor.ex:1136 (also :1077) — Stack-overflow traceback is empty on the lazy :lua_closure call path

Confirmed by direct reproduction against the PR head (cbf7068). State.check_call_depth!/1 raises the "stack overflow" error with call_stack: state.call_stack (state.ex:82). The PR stops pushing per-call frames on the :lua_closure path (the frames now live only in the lazy frames argument), so when check_call_depth! fires at executor.ex:1136 state.call_stack is empty. A 50-deep recursion forced down the pure-interpreter :lua_closure path (bytecode stripped — exactly the bytecode-fallback case from compiler/bytecode.ex:370) rendered call_stack length: 0 and emitted no "Stack trace:" section, raised from executor.ex:1136. The identical recursion through the normal compiled/dispatcher path rendered the full 50-frame traceback. The :compiled_closure branch shares the defect: check_call_depth! at line 1077 runs before the frames are materialized into state.call_stack at lines 1085-1091. The error still carries the correct message and value, so this is a traceback-fidelity regression, not a wrong result — but it is reachable by real Lua whenever a recursive function body contains any bytecode-unsupported instruction. The PR's own overflow test only matches /stack overflow/ and never inspects the traceback, so it does not catch this.

Suggestion: Materialize the lazy frames into the stack used for the overflow report at both depth-check sites. E.g. at line 1136 build overflow_stack = rebuild_call_stack(frames) ++ state.call_stack and have the depth check raise with that (pass it to a check_call_depth! arity that accepts an explicit stack); at line 1077 move/replicate the materialization before the check so the compiled-closure overflow also includes the rebuilt frames. Add a regression test on the pure-interpreter path that asserts the traceback contents (frame count / an in function 'rec' line), not just the message regex.

The executor's hot :lua_closure path tracks in-flight Lua frames lazily in its frames argument, not state.call_stack. When check_call_depth! fired on overflow it reported state.call_stack, which on that path is empty — the rendered stack-overflow error carried no traceback at all, unlike the eager dispatcher/compiled path which renders the full stack. Materialize the lazy frames into the reported stack at both depth-check sites: add check_call_depth!/2 taking an explicit overflow stack, and call_depth_ok?/1 so the :lua_closure hot path only rebuilds the stack on the rare failing branch (no success-path cost). The :compiled_closure path now checks against its already-materialized stack. Add a regression test on the pure-interpreter path asserting traceback contents (frame count and 'in function rec' lines), not just the message regex.

davydog187 · 2026-06-11T11:10:27Z

🤖 Ultracode review — round 2

[MAJOR] test/lua/vm/lazy_call_info_test.exs:1-216 — lazy_call_info_test.exs exercises the :compiled_closure dispatcher branch, not the lazy :lua_closure path it documents

The moduledoc claims these tests guard the lazily-rebuilt call_stack on the executor's Lua→Lua path where state.call_stack is left untouched — that is the :lua_closure clause at executor.ex:1101, the branch this PR optimizes (frame now carries :line/:name_hint). But every program here uses plain function f()...end/setmetatable closures, and those sub-prototypes are emitted with bytecode ([true, true]) per Compiler.compile. Per executor.ex:1029-1033 a non-nil-bytecode closure is tagged :compiled_closure, so nested calls route through the :compiled_closure hand-off at executor.ex:1068 — a different materialization site. Proof: the suite asserts debug.getinfo currentline == 0 (line 48, compiled path strips line info), whereas forcing the genuine lazy path with a bitwise op (3 & 1, which can't be bytecode-encoded) yields currentline == 3. So getinfo/traceback/__call fidelity and the call_stack-balance assertions on the real lazy :lua_closure branch are unguarded. The path is correct today (forced-lazy getinfo returns ["outer","global",3,"test.lua","Lua"] and call_stack balances to []), so this is a test-coverage gap rather than a live bug. recursion_depth_test.exs already uses the bitwise trick to cover the lazy overflow traceback, so that sub-case is genuinely guarded; the gap is the getinfo/traceback/__call/native-callback assertions. A future change to rebuild_call_stack/1 or the frame shape could break the lazy path with every test in this file still green.

Suggestion: Add the bitwise-op trick (as in recursion_depth_test.exs's @lazy_recursion) to each scenario so outer/inner/the __call callback/the native-callback deep/outer (lines 169-183) fall onto the :lua_closure path, OR add parallel lazy-variant tests. At minimum cover on the forced-lazy path: debug.getinfo(2,"nSl") fields for a nested chain, debug.traceback line content, the __call callback currentline/traceback, the native-dispatch save/restore (with state.call_stack == [] post-condition), and call_stack balance. Update the moduledoc to match what is actually exercised.
[MINOR] lib/lua/vm/executor.ex:1085-1089 — :compiled_closure hand-off materializes the full call_stack on every success, contradicting the rebuild_call_stack/1 "never on the hot success path" doc

Verified against PR head 9a50814. The :compiled_closure branch builds materialized_call_stack = [call_info | rebuild_call_stack(frames) ++ inherited_call_stack] unconditionally on every interpreter→dispatcher hand-off (lines 1085-1089), then passes it to check_call_depth!/2. The parallel :lua_closure branch (lines 1144-1151) instead gates rebuild_call_stack(frames) behind State.call_depth_ok?/1, so the O(depth) Enum.map walk only runs on overflow. Meanwhile rebuild_call_stack/1's own doc comment (executor.ex:555-557) states it is "Called only at the boundaries that actually read the stack ... never on the hot success path," and the frame-building comment claims lazy frames are "only materialized into a traceback when the limit is actually hit." Both statements are inaccurate for the :compiled_closure clause, which pays the walk eagerly on every hand-off success. Not a correctness bug — the stack is restored to inherited_call_stack on return (line 1098) and the suite is green — but it is a real per-hand-off cost the doc says is avoided.

Suggestion: Either gate the materialization behind State.call_depth_ok?/1 the way the :lua_closure clause does (only building materialized_call_stack when actually needed), OR — if the dispatcher genuinely needs a live stack for its whole duration because it can run native callbacks mid-execution — tighten the rebuild_call_stack/1 doc and the frame-building comment to acknowledge that the interpreter→dispatcher boundary materializes eagerly, so the "never on the hot success path" wording is not contradicted.

The existing lazy_call_info tests used plain function/setmetatable closures, whose sub-prototypes compile WITH bytecode and so route through the :compiled_closure dispatcher hand-off, not the lazy :lua_closure interpreter path the module name refers to. Add forced-lazy variants (sprinkling a non-bytecode-encodable `3 & 1` op into each function) for the getinfo, traceback, __call, native-callback balance, and error-raise scenarios, and update the moduledoc to name both materialization sites. Also tighten the rebuild_call_stack/1 doc: the interpreter -> dispatcher hand-off materializes the stack eagerly on every successful call (the dispatcher pushes its own entries and may run native callbacks that read it live), so the "never on the hot success path" wording was inaccurate for that one boundary.

davydog187 · 2026-06-11T11:25:01Z

🤖 Ultracode review — round 3

Blockers

[BLOCKER] lib/lua/vm/executor.ex:2429 (index_value __index), :2520 (table_index __index), :2577 (table_newindex __newindex), :2707/2711/2715 (invoke_metamethod) — Metamethod dispatch (__index/__newindex/arithmetic/concat/__len/comparison) drops enclosing Lua frames from state.call_stack on the lazy path
- Empirically confirmed regression, not speculation. The PR materializes lazy frames into state.call_stack at the __call boundary (line 1247), native dispatch (1176), the dispatcher hand-off (1090), and call_iterator/6 — but the other metamethod dispatch helpers that re-enter Lua via call_function/3 were missed. index_value/6 (2429), table_index (2520), table_newindex (2577), and invoke_metamethod/4 (2707/2711/2715) are all invoked from inside do_execute opcode clauses (e.g. index_value from 1859/1885/1892/1959/1966/2047) where frames is in scope and holds the live executor stack, but none thread frames in or materialize it. call_function/3 then recurses with frames=[] (executor.ex:157), so a metamethod that reads state.call_stack sees a truncated stack. Reproduced on the PR head (9a50814): a __index metamethod calling debug.traceback('idx') from a forced-lazy Lua function returns "idx\nstack traceback:" (enclosing driver frame GONE) and debug.getinfo(2,'nl').currentline returns -1. On merge-base 016fa35 the same program returns "idx\nstack traceback:\n\ttest.lua:9: in ?" and getinfo(2) returns 9. This breaks Lua 5.3 debug semantics and corrupts error tracebacks raised from inside metamethods on the lazy interpreter path.
- Suggestion: Thread frames into index_value/6, table_index, table_newindex, and invoke_metamethod/4 and wrap the call_function/3 calls at 2429/2520/2577/2707-2715 with the same inherited-stack/rebuild_call_stack(frames)/restore dance already used for __call at 1247. Add regression tests asserting debug.traceback / debug.getinfo / error tracebacks inside __index, __newindex, and an arithmetic metamethod still observe the enclosing Lua frame, on both the compiled and forced-lazy paths.

Minor

[MINOR] test/lua/vm/lazy_call_info_test.exs:1-377 — No regression test for the generic_for iterator materialization boundary (call_iterator/6)
- The diff introduces call_iterator/6 (executor.ex:582-588) specifically to materialize lazy frames into state.call_stack for a generic_for iterator call, and rewires both for-loop call sites to it (executor.ex ~689 and ~974). This is one of the five read boundaries the PR's own moduledoc enumerates, yet no test in the new file (or recursion_depth_test.exs) drives a for ... in loop combined with debug.* — confirmed by grep returning nothing. The path is currently correct, so this is a missing regression test rather than a live bug; but finding Pull in code #1 demonstrates that exactly these untested materialization boundaries are where the lazy refactor silently truncates tracebacks.
- Suggestion: Add a describe block driving for _, v in iter, nil, 0 do where iter calls debug.traceback / debug.getinfo(2,'nl').currentline, assert the enclosing frame's call-site line appears, cover both compiled and forced-lazy variants, and assert call_stack restores to [] afterward to pin the inherited-stack restore in call_iterator/6.

Nits

[NIT] test/lua/vm/lazy_call_info_test.exs:68-78 (and every forced-lazy test) — Lazy-path tests force the interpreter via the implicit 3 & 1 codegen quirk instead of the existing strip_bytecode/1 helper
- Every forced-lazy test sprinkles local _ = 3 & 1 to force the :lua_closure path, relying on bitwise ops being non-bytecode-encodable. If a future codegen change makes & encodable, these tests silently revert to the compiled path and stop exercising the lazy branch they are named for, with no failing signal. The repo already has an intent-revealing alternative in pcall_state_preservation_test.exs:44-48 (strip_bytecode/1 recursively nils proto.bytecode) run under for engine <- [:compiled, :interpreted], which forces the interpreter path structurally and independent of opcode coverage.
- Suggestion: Adopt the strip_bytecode/1 + per-engine for engine <- [:compiled, :interpreted] pattern from pcall_state_preservation_test.exs so the lazy path is forced structurally, making coverage robust to codegen changes and removing the duplicate golden line numbers per variant.

Metamethod dispatch that re-enters Lua via call_function/3 — __index, __newindex, arithmetic/bitwise/comparison/concat/__len, and the generic_for iterator — ran from inside the interpreter opcode clauses where the in-flight Lua frames are tracked lazily in the do_execute `frames` argument and are NOT in state.call_stack. The leaf helpers never threaded `frames` in, so call_function/3 recursed with frames=[] and a metamethod reading state.call_stack (debug.traceback / debug.getinfo, error tracebacks) saw a truncated stack with the enclosing Lua frame dropped. Thread `frames` through index_value, table_index, table_newindex, invoke_metamethod and the try_*/compare_le helpers, and route the call_function/3 leaf sites through a new call_metamethod/4 that performs the same inherited-stack / rebuild_call_stack / restore dance already used for the __call, native-dispatch, dispatcher hand-off, and generic_for boundaries. The public table_index/3, table_newindex/4 and table_length/2 arities (reached from the dispatcher and stdlib, where state.call_stack is already materialized) pass :materialized so the dance is a no-op. Add regressions in lazy_call_info_test asserting debug.traceback / debug.getinfo currentline observe the enclosing Lua frame inside __index, __newindex and an arithmetic metamethod, plus the generic_for iterator boundary, on both the compiled and forced-lazy paths, and that call_stack restores to [] afterward.

davydog187 · 2026-06-11T11:40:07Z

🤖 Ultracode review — round 1

[NIT] test/lua/vm/lazy_call_info_test.exs:262-296 — No regression test for a native-raised error's traceback relying on the line-1197 stack materialization

The native-dispatch boundary materializes the lazy executor frames into state.call_stack at executor.ex:1197 (rebuild_call_stack(frames) ++ inherited). On the error-from-native path, that materialized state is attached to the raised exception via annotate_frame_state/2 (executor.ex:1213, fills :state only when nil). This is the sole source of traceback fidelity for that path: a forced-lazy chain outer -> inner -> ("x"):rep(true) raises with the exception's OWN :call_stack empty (length 0) while the attached :state.call_stack carries the 2 outer frames. This boundary is distinct from both the success-path getinfo/traceback reads (which read the live stack before any raise) and the executor's own raise sites at executor.ex:1224/1236/1248 (which use the error's own :call_stack and ARE covered by the "attempt to call a nil value" tests). A regression in the line-1197 materialization would silently truncate stdlib-error tracebacks with no failing test. The code is correct today; this is purely a coverage gap on a perf PR whose stated risk is exactly "a consumer that reads the stack observes a truncated stack".

Suggestion: Add a small regression: a forced-lazy nested chain (outer -> inner) where inner triggers a native stdlib arg error (e.g. ("x"):rep(true)), and assert the rendered traceback / attached state.call_stack contains the outer frame. This pins the native-boundary materialization on the error path, which the existing success-path getinfo tests do not exercise.

davydog187 marked this pull request as ready for review June 10, 2026 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vm): rebuild call_stack lazily on the Lua->Lua call path#352

perf(vm): rebuild call_stack lazily on the Lua->Lua call path#352
davydog187 wants to merge 5 commits into
mainfrom
perf/executor-lazy-callinfo

davydog187 commented Jun 10, 2026 •

edited

Loading

Uh oh!

davydog187 commented Jun 11, 2026

Uh oh!

davydog187 commented Jun 11, 2026

Uh oh!

davydog187 commented Jun 11, 2026

Uh oh!

davydog187 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davydog187 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed and why

Benchmark — interpreter path (the real signal)

Original (dispatcher-path) measurement — superseded, retained for the record

Review

Merge gating

Stabilization + independent verification

Final verified numbers (interpreter path, bytecode-stripped, :lua_closure asserted; MIX_ENV=benchmark LUA_BENCH_MODE=full, Apple M4)

__call boundary fix (commit cbf7068)

Suite

Uh oh!

davydog187 commented Jun 11, 2026

🤖 Ultracode review — round 1

Uh oh!

davydog187 commented Jun 11, 2026

🤖 Ultracode review — round 2

Uh oh!

davydog187 commented Jun 11, 2026

🤖 Ultracode review — round 3

Blockers

Minor

Nits

Uh oh!

davydog187 commented Jun 11, 2026

🤖 Ultracode review — round 1

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davydog187 commented Jun 10, 2026 •

edited

Loading

Final verified numbers (interpreter path, bytecode-stripped, `:lua_closure` asserted; `MIX_ENV=benchmark LUA_BENCH_MODE=full`, Apple M4)

`__call` boundary fix (commit `cbf7068`)