perf(vm): compile bitwise ops and set_list multi-return tails#347
Conversation
The bytecode encoder bailed the entire prototype to the interpreter on
the first bitwise op or `{:multi, _}` set_list tail, so a single `n & 1`
or a `{x, f()}` constructor deoptimized a whole function.
Add dispatcher + encoder coverage for:
- band/bor/bxor (inline two-integer fast path, bridge to a new
Executor.dispatcher_bitwise/7 helper for coercion/metamethods)
- shl/shr/bnot (always bridge; shift masking lives in lua_shift_*)
- set_list with a {:multi, init_count} tail (folds init_count with
state.multi_return_count, mirroring executor.ex)
Once those prototypes compile, two latent dispatcher bugs surfaced and
are fixed here so the compiled path matches the interpreter:
- init_regs/init_callee_regs now carry the same +16 slack buffer the
interpreter uses; codegen undercounts the scratch registers a
multi-return constructor tail writes through.
- return_one grows the register tuple before writing a single value
into a multi-return expansion dest, mirroring ensure_regs_capacity.
goto/label stay on the interpreter (label-resolution semantics not yet
ported); they remain the encoder's fallback trigger in tests.
Plan: E
🤖 Ultracode review — round 1
|
…slow-path edges Add register_positions/1 cases for opcodes 53-59 (bitwise + shift + bnot + set_list_multi) to the max_registers invariant test, plus corpus programs that compile a bitwise op and all three set_list_multi tails (call, vararg, bare-vararg init_count==0). This removes the dead-accessor condition for the six bitwise op_* functions, which are consumed only by this invariant test, and refreshes the moduledoc now that the dispatcher carries a +16 slack buffer. In dispatcher_test, assert set_list_multi reached bytecode via a prototype-tree scan (encodes_op?/2) rather than first_sub/1, which only saw the sibling pair() prototype, and add the vararg-fed producers. Pin the bitwise slow-path edges named in the change's RISK statement through the compiled path: numeric-string coercion, the fractional-float and non-numeric-string raises, and the __shl/__bnot metamethod bridges.
🤖 Ultracode review — round 2All findings below are minor — test-coverage gaps on the PR's stated risk surface (compiled-path dispatcher tests). No blockers.
|
…eturn set_list tail Cover the __bor, __bxor, and __shr metamethod bridges on the compiled path so each clause's hardcoded metamethod name is pinned, add a negative-operand logical (unsigned) >> test, and a set_list multi-return tail whose call returns zero values.
🤖 Ultracode review — round 3
|
The compiled band/bor/bxor fast path stored the raw Bitwise result without the to_signed_int64 narrow the interpreter applies. A register can hold an integer outside [-2^63, 2^63-1] (a host int injected via Lua.set! / Value.encode/2, neither of which narrows), so the fast path diverged from the interpreter and leaked an out-of-range integer back into Lua state. Wrap the three results in Numeric.to_signed_int64 to match executor.ex, and add a regression test feeding an out-of-int64 operand through a compiled &/|/~.
🤖 Merge assessment — performance justificationVerdict: Merge. The "no perf improvement" concern is measuring the wrong A/B — the real win is a removed dispatch cliff, and it's large (~1.5x) on the path that changed. Measured impactI independently re-ran the comparison on a clean worktree of The whole-program The change's actual effect is compiled dispatcher path vs interpreter for the same bitwise code. Using the repo's bytecode-strip mechanism (
This confirms (and slightly exceeds) the PR body's corrected "1.35x / 1.16x" headline — my loop is bitwise-heavier per iteration, so a larger delta is expected. The deltas are an order of magnitude larger than the ±2-3% run deviation, so this is a real, repeatable speedup, not noise. Cost650/48 line diff, but most of it is mechanical and low-risk: 6 near-identical bitwise dispatch clauses, 6 executor bridge fns, 7 opcode constants, and ~300 lines of tests. Genuine risk surfaces:
Targeted tests all green on the branch tip: ReasoningA perf PR earns merge when it removes a dispatch cliff at acceptable risk — this does exactly that. Pre-PR, any function touching a bitwise op or a multi-return constructor tail ran fully interpreted; post-PR it compiles and runs ~1.5x faster with less memory. The "not much improvement" read comes only from the end-to-end Method caveats: single-machine (M4) Benchee run; the strip-bytecode A/B isolates the dispatcher vs interpreter on identical VM state, which is the right unit for this change but is a micro-benchmark — real-world gain depends on how hot the bitwise/multi-return code is in a given workload. |
🤖 Ultracode review — round 1
|
🤖 Ultracode review — round 2
|
🤖 Ultracode review — round 3Two minor test-coverage gaps, both on stated-risk axes. No blockers.
|
What changed and why
Closes dispatcher coverage gaps by compiling the bitwise opcodes (band/bor/bxor/shl/shr) and the
set_listmulti-return tail directly in the VM dispatcher, so these paths no longer fall back to the interpreter. Motivated by an external performance review flagging these as the remaining uncompiled dispatcher cases.Benchmark delta
Scenario: tight bitwise loop (200k-iteration numeric
fordoing band/bor/bxor/shl/shr per iter) and a forward-goto loop (200k-iteration numericforwith a forward goto/label dispatched every iteration). Note: backward-goto loops are NOT supported by this VM's executor —find_labelonly scans forward in the remaining instruction stream — so a goto-driven loop is not expressible;goto.luais skipped in the suite.origin/main@016fa35, clean rerun): bitwise 97.94 ms avg / 10.21 ips (±3.82%); goto 18.77 ms avg / 53.26 ips (±2.17%). Memory: bitwise 1022.9 MB, goto 195.0 MB.perf/dispatcher-coverage@968296c, clean rerun): bitwise 96.77 ms avg / 10.33 ips (±3.65%); goto 19.48 ms avg / 51.33 ips (±2.06%). Memory: bitwise 1022.9 MB, goto 195.0 MB (identical to baseline).Headline delta: bitwise -1.2% avg ms (ips +1.2%); goto +3.8% avg ms (ips -3.6%).
Verdict: INCONCLUSIVE / noise-dominated. Both deltas are smaller than the per-run deviation (±2-4%) and smaller than the baseline's own run-to-run swing (bitwise 102.45 -> 97.94 ms across its two runs). No measurable speedup or regression on these two micro-scenarios at n=200k; memory usage is identical on both refs. The -1.2% should not be read as a real gain. A true goto-driven loop could not be benchmarked because the executor does not support backward gotos.
Measured via
benchmarks/dispatcher_coverage.exs(left untracked in the worktree, not committed): loads two pre-compiled chunks viaLua.load_chunk!/2and timesLua.eval!/2on each with Benchee. RanMIX_ENV=benchmark LUA_BENCH_MODE=full mix run benchmarks/dispatcher_coverage.exs(2s warmup, 10s time, 1s memory_time), two runs each on016fa35and968296c.Review verdict
External review state: ready, tests pass. Two nits, both verified non-defects:
@op_*constants 53-59 by hand rather than referencingBytecode's accessor functions; parity enforced by comment + tests. Verified tags resolve to 53..59 in both files and consistent with the established mirroring pattern — latent duplication-drift risk only.to_signed_int64that the interpreter always applies. Correct because bitwise combination of two in-range int64s stays in range (no-op narrow); relies on the VM-wide invariant that registers only hold narrowed integers — same invariant the interpreter's other fast paths assume. Verified for int64-boundary AND and0xFFFF...FFFFmasking.Merge gating
Draft — human-gated merge. Do not merge automatically.
Stabilization + independent verification
Independently re-verified on a clean worktree of
perf/dispatcher-coverage@968296cagainstorigin/main@016fa35.Corrected headline metric (replaces the inconclusive top-level A/B)
The PR-body baseline-vs-branch comparison measures noise:
origin/maindeopts any bitwise-using prototype to the interpreter, so both refs ran the interpreter and the delta was jitter. The change's real effect is the compiled dispatcher path vs the interpreter for the same bitwise code, measured with the bytecode-strip harness (benchmarks/dispatcher_vs_interpreter.exsmechanism: recursivelyproto.bytecode = nil, assert:compiled_closurevs:lua_closure, then time).(
MIX_ENV=benchmark LUA_BENCH_MODE=full, 2s warmup / 10s time / 1s memory_time; 200k-iteration band/bor/bxor/shl/shr loop, Apple M4.)Honest caveat retained: a whole-program A/B at
eval!level shows no delta because the rest of the program already compiled on both refs.Scope clarification
:fallback; a bitwise/goto function on the branch resolves to:compiled_closure/:lua_closurerespectively). There is no dispatcher goto path; the PR covers bitwise +set_listmulti-return tail only.@reg_slack 16buffer +grow_regs/pad_nilsare required for theset_list_multitail ({x, f()}), which writes registers past codegen's reported peak.Correctness checks
set_listmulti shapes ({f()},{1,2,f()},{f(),99}) produce correct expansions/truncations.to_signed_int64skip is sound).<eval>:0vs interpreter's real line; line digit is the only observable divergence, consistent with arithmeticdispatcher_binop.Test status
Full suite green: 2262 passed, 19 skipped, 0 failures. lua53 suite (incl.
bitwise.lua,goto.lua),bitwise_test,dispatcher_test,bytecode_testall pass. No regressions.