feat(v2.1-rv64): switch to u16 limbs in deferral airs by shuklaayush · Pull Request #2808 · openvm-org/openvm

shuklaayush · 2026-05-22T13:31:29Z

PR 5 of the memory-bus-u16 split, stacked on #2794. Shrinks deferral trace columns whose constraints only operate on packed u16 values, and doubles the sponge absorption rate of the deferral output chip from one digest per row to two.

Why

#2794 still stores deferral commit / output-key columns byte-shaped while packing the memory-bus payloads. Shrinking those columns to u16 cells halves the affected trace width and lets the canonicity sub-AIR use per-u16-cell 16-bit range checks instead of byte-pair BitwiseOperationLookup checks.

Separately, the deferral output chip absorbs only DIGEST_SIZE = 8 bytes per Poseidon row even though the sponge state can hold 2 * DIGEST_SIZE = 16 bytes. Doubling the absorption rate halves the number of output rows for the same byte stream.

What changes

Trace columns to u16 cells

Files: extensions/deferral/circuit/src/{call,output,canonicity}/, extensions/deferral/circuit/src/utils.rs, extensions/deferral/circuit/cuda/.

Call core: input_commit, output_commit, output_len are u16 cells.
Output core: output_commit, output_len are u16 cells.
Canonicity sub-AIR operates over u16 cells; upper-bound checks become per-u16-cell 16-bit range checks (instead of byte-pair bitwise checks).
Deferral rd_val / rs_val register-pointer columns are u16-shaped.
Reuses feat(v2.1-rv64): switch memory bus blocks to u16 cells #2794's split_byte_memory_ops / split_f_memory_ops and byte_memory_op_chunk / f_memory_op_chunk helpers; no new split_memory_ops-style helper is introduced.
CUDA tracegen mirrors the new column shapes.

Sponge absorption rate

Files: extensions/deferral/circuit/src/output/{air,trace,execution,tests}.rs, extensions/deferral/circuit/src/def_fn.rs, extensions/deferral/circuit/src/extension/mod.rs, extensions/deferral/circuit/cuda/src/output.cu.

SPONGE_BYTES_PER_ROW = 2 * DIGEST_SIZE = 16 (was DIGEST_SIZE = 8).
sponge_inputs cells hold u16 values on data rows, each carrying two output bytes; data-row sponge_inputs receive 16-bit range checks. The init row remains [deferral_idx, output_len, 0, ...] and is exempt.
Two memory-bus writes per output row instead of one — the row now consumes 16 bytes of output, not 8.
def_fn::hash_output_raw packs byte pairs into each sponge cell before absorbing.
Bitwise-lookup wiring removed from the output chip; the call chip still uses bitwise lookup for non-canonical commit checks.

Output length contract

output_len remains a byte count.
Raw deferral outputs must be a multiple of 16 bytes; non-16-byte-aligned outputs are invalid with no implicit padding.
Host-side failure-path test asserts the panic when a non-16-aligned raw output is submitted.

Migration notes

Hosts producing deferral output must align raw outputs to a multiple of 16 bytes.
Output-chip configs that previously passed bitwise_lu should drop the argument.

resolves int-7834

## Summary - Removed `openvm/` prefix from source paths across 8 documentation files (21 occurrences) - Updated stale `v2-proof-system` repository reference in `docs/crates/recursion/README.md` to reflect the current openvm repo - Updated `stark-backend` path in `docs/crates/recursion/verifier-mapping.md` to link to the [stark-backend GitHub repo](https://github.com/openvm-org/stark-backend) Fixes issues identified in #2553 (comment) ## Test plan - [x] Verified no remaining `openvm/crates/recursion/src/` prefixes exist - [x] Verified no remaining `v2-proof-system` references exist - [x] Verified no remaining `stark-backend/crates/` local path references exist 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

This resolves INT-6814

This resolves INT-6828. Now `Eq3bAir` propagates the `row_idx` by the correct amount for each AIR. Note that there is no need for this air to know `air_idx` or anything other than the number of interactions and the `n_lift` for this air, so no new columns are added (but we now do the interaction with proof shape air on the last row of the air instead of the first one).

…2563) ## Summary - Add a `changes` detection job using `dorny/paths-filter@v3` to skip the `lint-cuda` job when no CUDA-related files (`*.cu`, `*.cuh`, `**/cuda/**`, `**/cuda*.rs`, etc.) are modified - Request at least 8 CPUs on the GPU runner (`/cpu=8`) for faster builds - The workflow file and CUDA cache action are also included as triggers so changes to CI itself still run the CUDA lint ## Test plan - [ ] Open a PR that doesn't touch any CUDA files and verify `lint-cuda` is skipped - [ ] Open a PR that touches a `.cu` or `.cuh` file and verify `lint-cuda` runs --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…height (#2564)

#2570)

)

…len (#2579)

…2580)

#2582)

~~- we need `cargo-openvm` from `develop-v2.0.0-rc.1` to run the reth benchmark for this branch since it adds new opcodes~~ resolves int-6949

## Summary - Adds `scripts/pre-push.sh`, a lightweight bash script that runs formatting, clippy, and tests locally on only the crates changed compared to a target branch (default: `develop-v2.0.0-beta`) - Auto-detects NVIDIA GPU via `nvidia-smi` and enables `--features cuda,touchemall` when present; CPU-only otherwise - Uses `cargo nextest` with `--cargo-profile=fast` when available (falls back to `cargo test`), and applies the `heavy` nextest profile for integration test crates ## Details The script follows three steps: 1. **`cargo +nightly fmt --all -- --check`** — full workspace format check (fast, nightly required per repo config) 2. **`cargo clippy`** — per changed crate, with appropriate feature flags and `-D warnings` 3. **`cargo nextest run`** — per changed crate, with `--cargo-profile=fast` and feature detection Crate detection walks changed file paths upward to find the nearest `Cargo.toml` with a `[package]` section. Benchmark and guest program crates are skipped (they require nightly `rust-src`). ## Usage ```bash ./scripts/pre-push.sh # compare against develop-v2.0.0-beta ./scripts/pre-push.sh main # compare against main ./scripts/pre-push.sh origin/my-feature # any valid git ref ``` ## Test plan - [ ] Run on a branch with Rust changes and verify only affected crates are checked - [ ] Run on a branch with no Rust changes (e.g., docs-only) and verify early exit - [ ] Run on a machine with GPU and verify `cuda,touchemall` features appear in output - [ ] Run on a machine without GPU and verify only `parallel` feature is used - [ ] Verify formatting failures are caught and reported --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com>

github-actions · 2026-05-26T18:49:32Z

Claude finished @shuklaayush's task in 2m 21s —— View job

Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs

Gather context and read the diff
Review deferral call chip changes (air, trace, tests)
Review deferral output chip changes (air, trace, execution, tests)
Review canonicity sub-AIR changes
Review utility functions and helpers
Review CUDA tracegen changes
Review continuations and verify-stark changes
Review guest program and test changes
Post final review

This is a well-structured PR that consistently migrates deferral columns from byte cells to u16 cells while doubling the output sponge absorption rate. The changes are mechanically clean and internally consistent across Rust AIR, tracegen, CUDA, and the verify-stark subcircuit. I have a few observations:

Findings

1. combine_output_cells accepts OUTPUT_LEN_NUM_U16S but caller passes BLOCK_FE_WIDTH

In call/air.rs:467 and output/air.rs:305-306, the output-len is expanded via expand_to_rv64_block into a [AB::Expr; BLOCK_FE_WIDTH] (8 elements), then passed to combine_output_cells which expects [T; OUTPUT_LEN_NUM_U16S] (4 elements). This works because combine_output_cells takes output_len_cells: [T; OUTPUT_LEN_NUM_U16S] and the array size is checked at compile time — but only because OUTPUT_LEN_NUM_U16S == 4 and BLOCK_FE_WIDTH == 8. Wait — actually this should be a compile error or the function signature accepts BLOCK_FE_WIDTH.

Let me re-read: combine_output_cells signature is:

pub(crate) fn combine_output_cells<T>(
    output_commit: impl IntoIterator<Item = T>,
    output_len_cells: [T; OUTPUT_LEN_NUM_U16S],  // 4 elements
) -> [T; OUTPUT_TOTAL_NUM_U16S]

But the callers pass output_len_full: [AB::Expr; BLOCK_FE_WIDTH] (8 elements). This should fail to compile unless OUTPUT_LEN_NUM_U16S == BLOCK_FE_WIDTH (both happen to be 4 since U16_CELL_SIZE=2, so OUTPUT_LEN_NUM_BYTES/U16_CELL_SIZE = 8/2 = 4 and BLOCK_FE_WIDTH = 4). With BLOCK_FE_WIDTH = 4 this works. The type-level coupling is safe since both are 4, but worth noting this relies on BLOCK_FE_WIDTH == OUTPUT_LEN_NUM_U16S. (No action needed, just an observation.)

2. Sponge init-row range checks are correctly exempted (Positive)

output/air.rs:332-337 — The data-row sponge cells get per-cell u16 range checks gated on local.is_valid - local.is_first, correctly exempting the init row (whose sponge_inputs are constrained via assert_array_eq against already-canonical deferral_idx and output_len columns). This is clean.

3. def_fn.rs:91 — output_ref.len() is cast to F via from_usize, which for BabyBear can silently reduce values ≥ 2^31 (Low risk)

def_fn.rs:91:

state[1] = F::from_usize(output_ref.len());

If the output length ever exceeded P ≈ 2^31 - 1 (extremely unlikely given the 16-byte-aligned constraint + address_bits bound), this would silently wrap. The address_bits range check in the AIR prevents this from happening in practice, so this is not a real bug, but using F::from_u32(u32::try_from(output_ref.len()).unwrap()) would make the intent clearer and catch unexpected sizes earlier.

4. Missing range check on input_commit cells in the call chip (Question)

In call/air.rs, the canonicity sub-AIR is applied to input_commit and the returned value is range-checked. But unlike output_commit and output_len (which get per-cell u16 range checks at lines 171-179), input_commit does not appear to get per-cell u16 range checks — only the canonicity remainder value is range-checked. For output_commit, both the canonicity RC and the per-cell RC are applied.

This is likely intentional: input_commit is read from memory (the memory bus verifies the cells are well-formed), while output_commit is written (and output_len is written), so the per-cell RC on writes ensures the values being committed to memory are canonical u16s. The canonicity sub-AIR on input_commit only needs its returned value range-checked since each limb is already guaranteed canonical by the read. Worth a quick confirmation this reasoning is correct.

5. Verify-stark range_inputs emptied, replaced by dedicated RangeCheckerAir (Structural change, looks correct)

trace.rs:145 now returns range_inputs: vec![] and instead passes the range_ctx from the output trace generator as a dedicated RangeCheckerAir<U16_BITS> proving context. This is a clean separation — the range checker now has its own trace in the subcircuit rather than being aggregated. The lib.rs correctly wires up the new RangeCheckerAir<U16_BITS> in both DeferredVerifyCircuit and the circuit AIR list.

6. RootCircuit refactored from derive_new to manual new with validation (Good defensive change)

The RootCircuit::new constructor now validates num_user_pvs via assert_user_pvs_shape, and the same assertion is used in commit/air.rs, commit/trace.rs, memory/air.rs, and memory/trace.rs. This replaces scattered debug_assert! calls with a single function and upgrades some to assert!. Good hygiene.

7. CUDA and Rust column layout consistency

The CUDA DeferralOutputCols<T> and DeferralCallCoreCols<T> / DeferralCallAdapterCols<T> structs mirror the Rust #[repr(C)] AlignedBorrow structs. The column shapes (RV64_PTR_U16S, COMMIT_NUM_U16S, F_NUM_U16S, SPONGE_ROW_MEMORY_OPS) are consistently defined in def_types.h to match utils.rs. The sizeof assertions (assert(width == sizeof(...))) guard against drift. Looks consistent.

8. Guest test expected outputs updated for 16-byte alignment (Correct)

All guest program expected outputs (single.rs, multiple.rs in both deferral/tests/ and continuations/) have been extended from 8-byte to 16-byte arrays. The make_result function in tests asserts the new alignment (SPONGE_BYTES_PER_ROW). The new test output_raw_len_must_be_sponge_row_aligned verifies the panic path for non-aligned outputs.

9. Minor: combine_output_cells function naming

combine_output_cells creates the concatenation [output_commit || output_len] from u16 cells. The name is clear but could be confused with combine_output (the byte version). Both coexist in utils.rs. This is fine but a single generic version parameterized by cell count could reduce the duplication in the future.

Summary

The PR is well-executed with consistent changes across the Rust AIR, tracegen, CUDA tracegen, and the verify-stark subcircuit. The key architectural change (u16 cells + doubled sponge rate) is cleanly factored through shared constants in utils.rs / def_types.h. The defensive validations (RootCircuit::new, alignment assertion test) are good additions. I didn't find any correctness bugs.

Verdict: Looks good to merge. The observations above are minor and don't block.

stephenh-axiom-xyz

Seems to be a lot of places that still use u8 instead of u16, I've pointed some out in comments

stephenh-axiom-xyz · 2026-05-26T20:31:40Z

        &self,
        builder: &mut AB,
-        x: &[AB::Var],
+        x: [AB::Expr; F_NUM_U16S],


nit: Is it possible to use a slice of Into<AB::Expr> or smth

stephenh-axiom-xyz · 2026-05-26T20:39:22Z

            output_commit_bus,
            def_idx: self.def_idx,
        };
+        let output_range_air = RangeCheckerAir::<U16_BITS> {


Given you converted the RISC-V chips to do 16-bit limb addition/multiplication, I think you could change the range checker in recursion to be 16-bits instead of adding it here?

hmm yeah, i was trying to not touch recursion but i guess maybe just bumping the recursion range checker to be 16-bits isn't that big of a deal

stephenh-axiom-xyz · 2026-05-26T20:44:43Z

+// The trace filler packs byte pairs into the u16-shaped columns.
+#[repr(C)]
+#[derive(Clone, Copy, Debug)]
+pub struct DeferralCallReadsBytes<F> {


Why not initialize the DeferralCallReads<u16, F> during execution?

stephenh-axiom-xyz · 2026-05-26T20:45:25Z

@@ -50,8 +49,8 @@ use crate::{
 #[repr(C)]
 #[derive(AlignedBorrow, StructReflection, Clone, Copy, Debug)]
 pub struct DeferralCallReads<B, F> {


nit: B stands for bytes

stephenh-axiom-xyz · 2026-05-26T21:00:10Z

                    .fold(idx as u8, |acc, byte| acc.wrapping_add(*byte));
                let num_chunks = (seed as usize % 4) + 1;
-                let len = num_chunks * DIGEST_SIZE;
+                let len = num_chunks * SPONGE_BYTES_PER_ROW;


Why is the length changed here? The input should be in u16s now no?

stephenh-axiom-xyz · 2026-05-26T21:02:54Z


    // First row sponge input is [deferral_idx, output_len, 0, ...].
-    // Later rows sponge and write the next DIGEST_SIZE output bytes.
+    // Later rows sponge and write the next SPONGE_BYTES_PER_ROW output bytes.


Why aren't we measuring this in u16s?

stephenh-axiom-xyz · 2026-05-26T21:06:00Z

@@ -78,7 +81,10 @@ fn hash_output_raw<F: VmField>(
    deferral_idx: u32,
    output_ref: &[u8],


Same comment, why not u16?

Add comprehensive test coverage for decoder crate. `cargo tarpaulin` reports: ```bash || crates/toolchain/decoder/src/instruction_formats.rs: 48/48 +0.00% || crates/toolchain/decoder/src/process_instruction.rs: 148/150 +0.00% ``` The remaining 2 uncovered lines are unreachable match arms (`_ => None`) for a fully exhaustive 3-bit funct3 match in `process_instruction.rs`. towards INT-6467

GunaDD · 2026-05-29T20:31:15Z

was this accidentally included ?

GunaDD · 2026-05-29T20:42:12Z

-        debug_assert!(num_user_pvs >= DIGEST_SIZE);
-        debug_assert!(num_user_pvs.is_multiple_of(DIGEST_SIZE));
-        debug_assert!((num_user_pvs / DIGEST_SIZE).is_power_of_two());
+        assert_user_pvs_shape(num_user_pvs);
        let encoder = Encoder::new(num_user_pvs / DIGEST_SIZE, MAX_ENCODER_DEGREE, true);


I guess this is unrelated to the u16 memory bus change

GunaDD · 2026-05-29T21:31:22Z

+        memory_dimensions: MemoryDimensions,
+        num_user_pvs: usize,
+    ) -> Self {
+        assert_user_pvs_shape(num_user_pvs);


why was this assertion added now ? was this always needed or is this a new requirement

github-actions · 2026-06-04T09:08:19Z

Claude finished @shuklaayush's task in 2m 49s —— View job

Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs

This is a clean, well-structured PR that consistently migrates deferral columns from byte cells to u16 cells and doubles the output sponge absorption rate. I reviewed all 40 changed files across Rust AIR, tracegen, CUDA, verify-stark, and continuations. Below are my findings, ordered by significance.

Findings

1. Missing per-cell u16 range checks on input_commit in call core AIR (Question — likely intentional)

In call/air.rs:170-173, the input_commit_rcs (canonicity remainder values) are range-checked. But unlike output_commit and output_len (which get per-cell u16 range checks at lines 181-191), input_commit cells themselves are not individually range-checked in the core AIR. The adapter AIR reads input_commit from RV64_MEMORY_AS via the memory bridge (line 411-424), so the memory bus guarantees the cells are well-formed u16 values — the write side of the bus must have range-checked them. This reasoning is sound provided the u16 memory bus in PR #2794 enforces cell canonicity on writes. Worth confirming this invariant holds.

The tracegen in call/trace.rs:182-184 correspondingly only range-checks output_commit and output_len cells — input_commit cells are omitted — which is consistent with the AIR.

2. DeferralCallReadsBytes / DeferralCallWritesBytes are not #[repr(C)] aligned to DeferralCallReads / DeferralCallWrites (By design, not a bug)

The new byte-shaped record types (call/air.rs:72-91) have different field types than the u16-cell AIR columns. The trace filler manually packs bytes into u16 cells (call/trace.rs:156-164). This is a reasonable pattern — the execution record stays in the natural byte representation, and the filler handles the packing. The #[repr(C)] on the byte structs ensures the layout is predictable for the AlignedBytesBorrow derive on DeferralCallCoreRecord.

3. output_len_full size coupling between expand_to_rv64_block and combine_output_cells (Safe, compile-time checked)

combine_output_cells expects [T; OUTPUT_LEN_NUM_U16S] (= 4). expand_to_rv64_block returns [T; BLOCK_FE_WIDTH] (= 4). These match because OUTPUT_LEN_NUM_BYTES / U16_CELL_SIZE = 8/2 = 4 = BLOCK_FE_WIDTH. If either constant changed independently, this would be a compile error, so it's safe. But the semantic meaning differs: the expand_to_rv64_block zero-extends the F_NUM_U16S = 2 output length cells to 4, and then combine_output_cells treats all 4 as u16 cells for the output key. The upper 2 cells are always zero since output_len is a u32 value stored in F_NUM_U16S = 2 cells. This is correct.

4. Sponge init-row range-check exemption is correct (Positive)

output/air.rs:332-337 (in the diff) — Data-row sponge cells get per-cell u16 range checks gated on local.is_valid - local.is_first, correctly exempting the init row. The init row's sponge_inputs are pinned by assert_array_eq against already-canonical deferral_idx and output_len columns.

5. def_fn.rs:91 — F::from_usize(output_ref.len()) silent wrapping for huge lengths (Low risk)

state[1] = F::from_usize(output_ref.len());

For BabyBear, values >= 2^31 would silently reduce mod P. The address_bits range check in the AIR prevents this in practice, but using F::from_u32(u32::try_from(output_ref.len()).unwrap()) would make the intent clearer and catch unexpected sizes earlier in host code. This is a minor defensive improvement, not a soundness issue.

6. e2e.rs test refactored to use generate_deferral_results directly (Good cleanup)

crates/continuations/src/tests/e2e.rs:134-145 — The compute_output_f_commit function was previously a manual reimplementation of the sponge hash. It now delegates to generate_deferral_results, which is the canonical implementation. This eliminates the risk of the test's hash diverging from the production code when the sponge rate changes. Well done.

7. RootCircuit::new validation and assert_user_pvs_shape (Good defensive change)

crates/continuations/src/circuit/root/mod.rs:41-55 — The manual new constructor with assert_user_pvs_shape replaces derive_new, upgrading scattered debug_assert! calls to a single assert! function. This is used in commit/air.rs, commit/trace.rs, memory/air.rs, and memory/trace.rs. Good hygiene that catches misconfigurations in release builds.

8. CUDA / Rust constant and column struct consistency (Correct)

def_types.h mirrors every constant from utils.rs: COMMIT_NUM_U16S, OUTPUT_LEN_NUM_U16S, OUTPUT_TOTAL_NUM_U16S, SPONGE_BYTES_PER_ROW, SPONGE_ROW_MEMORY_OPS, F_NUM_U16S, RV64_PTR_U16S. The DeferralOutputCols<T> and DeferralCallCoreCols<T> / DeferralCallAdapterCols<T> structs in CUDA match the Rust #[repr(C)] layouts. sizeof assertions guard against drift. The helper functions (pack_u8_pairs_le, u32_bytes_to_le_u16_cells, scale_u16_high_cell, scale_output_len, scale_rv64_ptr_from_u32_bytes) mirror their Rust counterparts.

9. bitwise_op_lookup fully removed from deferral extension (Clean removal)

The BitwiseOperationLookupBus / BitwiseOperationLookupChip dependency has been completely removed from the deferral extension: extension/mod.rs no longer creates or looks up bitwise chips, cuda_abi.rs drops the d_bitwise parameter, and all tests drop bitwise periphery. The call chip still uses the canonicity sub-AIR (which was already not using bitwise), and the output chip's bitwise usage is replaced by VariableRangeCheckerBus range checks. This is a clean factoring.

10. is_eq.rs modular chip fix (Tangential but correct)

extensions/algebra/circuit/src/modular_chip/is_eq.rs — The setup-row handling is fixed: during setup, c may equal the modulus (it's the modulus itself), so the c < modulus assertion is correctly skipped. The c_diff_idx fallback finds the first differing limb or defaults to 0, and uses wrapping_sub to avoid panic when c == modulus. The c_lt_mark is forced to F::TWO during setup to avoid the single-diff-index optimization colliding with the setup row's different semantics. This looks like a necessary fix for setup correctness with u16 limbs.

11. Adapter ptr_to_u16_limbs renamed to ptr_to_field_u16_limbs (Minor, in riscv-adapters)

vec_heap_u16.rs and vec_heap_branch_u16.rs update the function name. The rename avoids confusion with the raw u16 version — ptr_to_field_u16_limbs returns [F; N] directly.

Addressing reviewer comments

@stephenh-axiom-xyz's comment about u8 vs u16: Several places (record types, output_len bytes) still use u8 in the record / host-side representation. This is intentional — the execution record captures byte-level data, and the trace filler packs into u16 cells. The column types (DeferralCallReads<T, T>, DeferralCallWrites<T, T>) are now u16-shaped.

@stephenh-axiom-xyz's comment about BitwiseOperationLookupAir for u16 ops: Correctly noted by @shuklaayush that the table size would need to be 2^32, making it impractical.

@stephenh-axiom-xyz's comment about recursion range checker: @shuklaayush noted this could be bumped to 16-bit. The current PR adds a dedicated RangeCheckerAir<U16_BITS> in the verify-stark subcircuit (lib.rs:108), which is clean but does add an extra AIR. A future PR could unify this with the recursion range checker.

@GunaDD's comment about multiple.elf: This is the pre-compiled ELF for the updated guest program — it's expected to be committed alongside the source change.

@GunaDD's comment about RootCircuit::new assertion: This upgrades existing debug_assert! checks to assert! and centralizes them. The requirements (num_user_pvs >= DIGEST_SIZE, multiple of DIGEST_SIZE, power-of-two leaf count) were always needed — they just weren't checked in release builds.

Summary

The PR is well-executed with consistent changes across Rust AIR, tracegen, CUDA, verify-stark, and continuations. The migration from byte cells to u16 cells is mechanically clean, and the doubled sponge absorption rate is correctly threaded through all layers. The is_eq.rs fix is a necessary companion change for setup-row correctness with u16 limbs. I didn't find any correctness bugs.

Verdict: Looks good to merge. The only actionable suggestion is the minor defensive F::from_u32(u32::try_from(...).unwrap()) in def_fn.rs:91.

# Switch memory bus blocks to u16 cells PR 1 of the memory-bus-u16 split. Atomic flip of the memory-bus block from `8 u8 cells` to `4 u16 cells`. The bus interaction is now fixed-shape (`BLOCK_FE_WIDTH = 4`), and byte-shaped chips pack their u8 columns into bus cells via a single helper. ## Why `MemoryBus` interactions previously carried 8 field-element-encoded u8s per message. RV64 cell-typed memory (registers, RV memory, public values) is naturally u16-aligned, and packing two bytes per bus cell halves the bus payload width without losing constraint power. This PR makes the bus shape a single hard constant (`BLOCK_FE_WIDTH`) so the bridge API and downstream chips no longer need a `const N` generic for it. Follow-up PRs will use the wider cells to narrow trace columns, prune redundant range checks, reshape SHA-2 / Keccak state, and revisit deferral. ## What changes ### Memory layout constants Files: `crates/vm/src/arch/config.rs`, `crates/vm/cuda/include/system/memory/params.cuh`. - Single source of truth: `U16_CELL_SIZE = size_of::<u16>()` (intrinsic) drives `BUS_PTR_SCALE`; `BLOCK_FE_WIDTH = 4` is the design choice; `MEMORY_BLOCK_BYTES`, `BUS_BLOCK_STRIDE`, `BUS_LEAF_STRIDE`, `DIGEST_WIDTH` all derive. - Cross-constant equality asserts removed where the derivation makes them true by construction. - One divisibility guard kept on `DIGEST_WIDTH.is_multiple_of(BLOCK_FE_WIDTH)`. - CUDA-side `params.cuh` mirrors the constants and includes ASCII diagrams for the u16-celled AS layout (RV64) and the F-celled AS layout (`DEFERRAL_AS`), showing how `byte_ptr`, `cell_idx`, and `bus_ptr` relate. ### Layout reference Terminology: - **Cell** — one storage word in an address space (`u16` for RV64 ASes, `F` for `DEFERRAL_AS`). - **Block** — one memory-bus message: `BLOCK_FE_WIDTH` cells = `MEMORY_BLOCK_BYTES` bytes. - **Digest** — output of one Poseidon2 compression, `DIGEST_WIDTH` cells; also one merkle leaf. - **Leaf** — one merkle-tree leaf = one Poseidon2 half = `DIGEST_WIDTH` cells = `BLOCKS_PER_LEAF` blocks. <details> <summary><b>u16-celled AS layout</b> (<code>RV64_REGISTER_AS</code>, <code>RV64_MEMORY_AS</code>, <code>PUBLIC_VALUES_AS</code>)</summary> One merkle leaf = 16 bytes = 8 u16 cells = 2 bus blocks: ```text byte_ptr: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ u8 storage: │b0 │b1 │b2 │b3 │b4 │b5 │b6 │b7 │b8 │b9 │b10│b11│b12│b13│b14│b15│ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘ ╰──────╯╰──────╯╰──────╯╰──────╯╰──────╯╰──────╯╰──────╯╰──────╯ u16 cells (LE) cell_idx: 0 1 2 3 4 5 6 7 bus_ptr: 0 2 4 6 8 10 12 14 ╰─────────── block 0 ──────────╯╰─────────── block 1 ──────────╯ ╰────────────────── one merkle leaf / digest ──────────────────╯ ``` - `byte_ptr = U16_CELL_SIZE * cell_idx` (coincides with `bus_ptr` for u16 ASes) - `bus_ptr = BUS_PTR_SCALE * cell_idx` - Block stride on bus: `BUS_BLOCK_STRIDE = 8 = BUS_PTR_SCALE * BLOCK_FE_WIDTH` - Leaf stride on bus: `BUS_LEAF_STRIDE = 16 = BUS_PTR_SCALE * DIGEST_WIDTH` </details> <details> <summary><b>F-celled AS layout</b> (<code>DEFERRAL_AS</code>)</summary> Each cell holds one `F` element (`size_of::<F>()` bytes on host; 4 for BabyBear). The bus addresses cells directly, so cell / block / leaf counts and `bus_ptr` / `cell_idx` mappings match the u16 AS. ```text byte_ptr: 0 4 8 12 16 20 24 28 ┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐ F storage: │ F0 │ F1 │ F2 │ F3 │ F4 │ F5 │ F6 │ F7 │ └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘ cell_idx: 0 1 2 3 4 5 6 7 bus_ptr: 0 2 4 6 8 10 12 14 ╰─────────── block 0 ──────────╯╰─────────── block 1 ──────────╯ ╰────────────────── one merkle leaf / digest ──────────────────╯ ``` - `byte_ptr = size_of::<F>() * cell_idx` (host-memory stride; ≠ `bus_ptr`) - `bus_ptr = BUS_PTR_SCALE * cell_idx` (same formula as u16 ASes) - `BUS_BLOCK_STRIDE` and `BUS_LEAF_STRIDE` unchanged — the bus addresses cells, not bytes. </details> ### Address-space cell types - `RV64_REGISTER_AS`, `RV64_MEMORY_AS`, `PUBLIC_VALUES_AS` switch to u16-celled. - `DEFERRAL_AS` stays F-celled (moves to u16 in a follow-up). ### Persistent memory and merkle - Boundary AIR, merkle tree, merkle public-values, and CUDA inventory rewritten for 4-cell-per-block u16 leaves (`BLOCKS_PER_LEAF = DIGEST_WIDTH / BLOCK_FE_WIDTH = 2`). - `assert_public_values_shape::<DIGEST_WIDTH>(num_public_values_bytes)` and `SystemConfig::public_values_cell_count()` formalize the public-values byte-count → cell-count contract at construction and at every read site. ### Offline-checker bridge File: `crates/vm/src/system/memory/offline_checker/bridge.rs`. - `MemoryBridge::read` / `write` drop the `const N` generic; bus payload is always `[T; BLOCK_FE_WIDTH]`. - Unified `MemoryWriteAuxInput` lets `write` accept either an `&MemoryWriteAuxCols<V, BLOCK_FE_WIDTH>` (standard case) or `from_prev_data_exprs(&base_aux, prev_data_expr)` for chips that materialize prev_data in their own columns (Keccak, SHA-2, LoadStore). - Added `pack_u8_block` (AB::Expr) and `pack_u8_block_value` (F-value form) so byte-shaped chips and testing helpers produce bus messages identical to what u16 chips emit natively. - Removed `read_or_immediate` / `MemoryReadOrImmediateOperation` (an RV32 pattern superseded by `RV64_IMM_AS`). ### TracingMemory / GuestMemory byte view - `read_bytes` / `write_bytes` (cell-type-aware, BLOCK-aligned) for raw byte access against u16-celled storage. - Deferral-specific helpers (`timed_read_deferral`, `tracing_read_deferral`, …) for the F-celled AS. ### Chip call-sites ~50 sites across riscv, riscv-adapters, keccak256, sha2, deferral. - Every `memory_bridge.read` / `.write` now hands the bridge a `BLOCK_FE_WIDTH`-shaped payload. - Byte-shaped chips wrap their `[T; 8]` u8 columns with `pack_u8_block`. - Register-read paths still zero-extend the low 32 bits to 8 bytes, then pack. - Per-AS metered leaf-bit shift in `execution_mode/metered/memory_ctx.rs` (`U16_AS_LEAF_PTR_BITS = 4` for byte-pointer ASes, `CELL_AS_LEAF_PTR_BITS = 3` for cell-pointer ASes), so RV64 byte pointers produce the right leaf label. ### Algebra / ECC adapter cleanup No-op semantic; finishes upstream removal. - Dropped phantom `BLOCK_SIZE` const generic from `ModularAir`/`ModularChip`, `Fp2Air`/`Fp2Chip`, `WeierstrassAir`/`WeierstrassChip` and their `get_*_air` / `get_*_chip` helpers. - The generic became unused after `Rv64VecHeapAdapter*` dropped its `BLOCK_SIZE` parameter; keeping it was a phantom that broke inference at call sites. ### Testing infrastructure - `MemoryTester` / `DeviceMemoryTester` accept either `N = BLOCK_FE_WIDTH` (cell index) or `N = MEMORY_BLOCK_BYTES` (byte pointer in a u16 AS) and route accordingly; the byte path packs the result so the dummy chip's bus message matches a byte-shaped chip's exactly. - `pack_u8_block_value` is shared between the testing helper and the CUDA boundary-chip test that previously hand-rolled the same packing. ### CUDA - `params.cuh` constants + static_asserts kept in sync. - Kernel memory-aux fill paths updated for the new block shape. ## Out of scope (follow-up PRs) - Trace column narrowing now that u16-celled storage matches column width. - Redundant range-check removal on byte-shaped chips. - Dead bitwise-lookup cleanup. - SHA-2 / Keccak state reshaping to natively-u16 layouts. - Moving `DEFERRAL_AS` to u16. - Deferral commit / output-hash changes. ## Migration notes - `MemoryBridge::read::<N>` / `::write::<N>` calls: drop `<N>`; pass a `BLOCK_FE_WIDTH`-shaped payload. Byte-shaped data goes through `pack_u8_block` (constraint side) and `pack_u8_block_value` (witness side). - `read_or_immediate` callers: switch to `RV64_IMM_AS`-based addressing. - `MemoryWriteAuxCols<V, N>` with `N != BLOCK_FE_WIDTH`: use `MemoryWriteAuxInput::from_prev_data_exprs(&base_aux, prev_data)` to supply prev_data inline. - `ModularAir`, `ModularChip`, `Fp2Air`, `Fp2Chip`, `WeierstrassAir`, `WeierstrassChip` now take one fewer const generic (`BLOCK_SIZE` dropped). Resolves INT-7829 --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com>

PR 3 of the memory-bus-u16 split, stacked on **#2794** (disjoint from PR 2). Shrinks SHA-2 and Keccak chip trace columns that are only consumed as packed u16 cells through the memory bus. ## Why #2794 still stores SHA-2 main-chip state / message bytes and Keccak state / pointer columns byte-shaped, packing them at memory-bus call sites via `pack_u8_block`. Because these columns are never inspected byte-by-byte by the AIR (the constraints operate on packed u16 values, or on byte-XOR buffers that intentionally stay byte-shaped), they can be stored u16-shaped directly — halving the affected trace columns and removing one `pack_u8_block` per row. This is the second of the four cell-narrowing PRs (after PR 2 for RV64 control-flow) and is disjoint from it: the only files in this PR live under `extensions/sha2`, `extensions/keccak256`, and `crates/circuits/sha2-air`, plus one helper addition in `extensions/riscv/circuit/src/adapters/mod.rs`. ## What changes ### SHA-2 section Files: `extensions/sha2/circuit/src/sha2_chips/{main_chip,block_hasher_chip}/`, `crates/circuits/sha2-air/src/{air,columns,trace}.rs`, `extensions/sha2/circuit/cuda/`. Main chip: - `prev_state`, `new_state` are u16 cells. - `message_bytes` reshapes to `message_u16s` (byte pairs packed at the call site). - Pointer limbs `dst_ptr_limbs`, `state_ptr_limbs`, `input_ptr_limbs` reshape to u16 where the AIR only consumes them as packed pointer cells; the high u16 cell gets the scaled 16-bit range check. Block hasher: - `prev_hash`, `final_hash` are u16 cells. `sha2_bus`: - `MESSAGE_1` / `MESSAGE_2` payloads are packed to u16 on both sender and receiver. Stays byte-shaped (8-bit bounded by construction): - `carry_a`, `carry_e`, `carry_or_buffer` in `sha2-air`. ### Keccak section Files: `extensions/keccak256/circuit/src/{keccakf_op,xorin}/`, `extensions/keccak256/circuit/cuda/`. `keccakf_op`: - `preimage`, `postimage` reshape to u16 cells. - `buffer_ptr_limbs` reshape to u16; pointer composition uses the scaled 16-bit bound. - `keccakf_state_bus` payload between `keccakf_op` and `keccakf_perm` is packed to u16. `xorin`: - Pointer limbs u16-shaped (same treatment as `keccakf_op`). - Byte-XOR buffers (`preimage_buffer_bytes`, `input_bytes`, `postimage_buffer_bytes`) stay byte-shaped — those columns still carry byte-level XOR data and the AIR inspects them directly. CUDA tracegen mirrors the new shapes; `keccakf_op` drops the `bitwise_lookup` parameter from its kernel and ABI signatures now that pointer bytes no longer need byte-level checks. The call chip still wires `bitwise_lu`. ### Adapter helper (cross-cut) File: `extensions/riscv/circuit/src/adapters/mod.rs`. - Add `expand_to_rv64_block`: zero-pads `N` u16 limbs to one RV64 bus block (`BLOCK_FE_WIDTH` cells). Used by `keccakf_op` and `xorin` for pointer reads against u16-celled `RV64_REGISTER_AS`. - `BLOCK_FE_WIDTH` imported into the adapters module to support the new helper. The existing `expand_to_rv64_register` (8 u8 limbs, byte-shaped) is kept for byte-shaped chips and the legacy paths. ## Migration notes - Chips passing SHA-2 main-chip state through the bus should hand the bus u16 cells directly (no `pack_u8_block`). - Pointer reads against u16-celled `RV64_REGISTER_AS` from chips storing only the low 32 bits of the register should use `expand_to_rv64_block` (BLOCK_FE_WIDTH cells, zero-padded) instead of `expand_to_rv64_register` (8 u8s). - Keccak pointer composition should use the scaled 16-bit bound on the high u16 cell; the previous byte-pair range check on `bitwise_lookup_bus` is removed for `keccakf_op`. resolves int-7832, int-7833 --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR 2 of the memory-bus-u16 split. This migrates selected RV64 chips from byte-shaped register/memory columns to u16-shaped columns after #2794 made the memory bus operate on fixed `BLOCK_FE_WIDTH = 4` u16 cells. The scope is intentionally split into two classes: - For equality/comparison-style chips, `hintstore`, and the new u16 adapters, the main change is the limb shape: `u8` limbs become `u16` limbs, byte-pair bitwise range checks become per-limb `VariableRangeChecker` checks, and the core logic remains essentially the same. - For control-flow instructions that previously used byte-level decompositions for PC/result arithmetic (`auipc`, `jalr`, `jal_lui`), the constraints were reworked more substantially because the old carry structure was tied to 8-bit limbs. Subsequent PRs build on the u16 adapter variants and bigint limb constants introduced here. ## Why #2794 made the memory bus fixed-shape over packed u16 cells. Some chips still materialized 8 byte columns per RV64 register value, packed them into u16 bus cells at the memory interaction, and carried `BitwiseOperationLookup` only to prove those byte limbs were canonical. For chips that do not inspect byte-level structure, storing u16 limbs directly halves the affected register-value trace width and removes those byte-pair lookup dependencies. The bus sees the native u16 payload, while the AIR uses `VariableRangeChecker` to prove each materialized u16 limb is bounded. For PC/result arithmetic, the old byte AIRs also avoided field-modulus aliasing by carrying through small limbs. The u16 rewrites preserve that property with fewer, wider limbs; they do not replace 32-bit arithmetic with one unconstrained field equality. ## What Changes ### Mostly Mechanical Limb-Shape Migrations These chips keep the same core argument, but change the limb type/width: - `branch_eq`: compares u16 limbs directly. Equality inverse logic is unchanged. - `branch_lt`: signed/unsigned comparison logic is unchanged, with `LIMB_BITS = 16` and u16 MSB handling. - `less_than`: same comparison migration, wired through the u16 ALU adapter. - `hintstore`: stores `mem_ptr`, `rem_words`, and hint data as u16-shaped columns. `rem_words` fits in one u16 cell, and pointer bounds are checked through the high u16 limb. - Bigint `BranchEqual256`, `BranchLessThan256`, and `LessThan256`: inherit the u16 shape with `INT256_NUM_U16_LIMBS = 16`. The new u16 adapters follow the same pattern: register pointer values and heap/register payloads are represented as u16 cells directly, with range checks moving from byte-pair lookup to `VariableRangeChecker`. Byte-shaped bigint chips stay byte-shaped: - `BaseAlu256`, `Mul256`, `Shift256`, and `Sll256` still use `INT256_NUM_U8_LIMBS = 32` because their AIRs rely on byte-level boolean, carry, or shift structure. ### Constraint Rewrites These chips needed more than a limb-size substitution: - `auipc`: the old AIR decomposed the PC and immediate into byte limbs and carried the addition byte-by-byte. The new AIR uses u16 limbs instead: it derives/range-checks `pc_low` from `from_pc` and a witnessed `pc_high`, then constrains `rd = from_pc + (imm << 8)` with low-u16 and high-u16 carry bits. This keeps each arithmetic relation below the field modulus while dropping the extra byte-limb PC columns. - `jalr`: the old AIR used low-32-bit byte limbs for `rs1`, `to_pc`, and `rd = pc + 4`. The new AIR keeps u16 limbs, derives the low rd limb from `from_pc`, range-checks the high rd limb against `PC_BITS`, and carries `rs1 + sign_extend(imm)` over two u16 limbs while separately enforcing the cleared target low bit. - `jal_lui`: JAL derives the low rd limb from `from_pc + 4` and the witnessed high u16 limb, then range-checks the high limb against `PC_BITS`. LUI uses a u16-specific split because `imm << 12` crosses a u16 boundary: `rd[0] = imm_low_4 << 12` and `imm = imm_low_4 + (rd[1] << 4)`. The old byte-limb form would still be possible, but the u16 form uses fewer witness columns/range checks while preserving limb-by-limb integer soundness in the BabyBear field. ### New u16 Adapter Variants - `Rv64BaseAluU16Adapter`: reads/writes one `[u16; BLOCK_FE_WIDTH]` register block directly and handles immediates as u16-low-limb plus sign bit. - `Rv64VecHeapU16Adapter`: u16-cell heap reads/writes with u16-shaped register pointer values. - `Rv64VecHeapBranchU16Adapter`: read-only branch variant for u16-shaped heap operands. - `VecToFlatAluAdapter` gains a `BLOCK_VALUE_WIDTH` parameter so both byte-shaped and u16-shaped heap adapters can share the flattening logic. The legacy byte-shaped adapters remain for chips that still need byte-level constraints. ### Wiring - RISC-V and bigint extension wiring now threads `VariableRangeChecker` into the u16-shaped chips. - `BitwiseOperationLookup` remains wired only for byte-shaped chips that still need byte-level constraints. - Rv64Io no longer creates a bitwise lookup solely for `hintstore`. ### CUDA Parity CUDA tracegen mirrors the Rust changes: - migrated core records use `uint16_t` arrays where applicable; - kernels use `VariableRangeChecker.add_count(limb, LIMB_BITS)` for u16 limbs; - bitwise lookup parameters are removed from migrated kernels; - bigint CUDA wiring separates byte-shaped chips from u16-shaped chips. ### Helper Cleanup - `INT256_NUM_U8_LIMBS` and `INT256_NUM_U16_LIMBS` make bigint limb width explicit. - Shared RV64 helpers cover byte/u16 conversion, low-32-bit pointer splitting, pointer-bound range-check values, and zero-extension to one u16 memory block. - Byte-view memory helpers use explicit `read_bytes` / `write_bytes` where the access is intentionally byte-addressed. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rs (#2805) PR 4 of the memory-bus-u16 split, stacked on **PR 2**. Shrinks adapter trace columns that only reconstruct packed memory-bus payloads, and migrates the algebra modular `is_eq` chip to u16 limbs. This PR sits on PR 2 (not #2794) because the PR 2 squash extended the shared vec-heap u16 adapter family and dropped the `INT256_NUM_U8_LIMBS` from `riscv/circuit::adapters`; PR 4 builds on those changes. ## Why #2794 already reshaped these adapters' memory-aux columns and bus payloads to packed u16 blocks, but the **trace columns** that reconstruct those payloads were still byte-shaped: - `alu_w` / `mul_w` stored `rs1_high` / `rs2_high` as 4 u8 limbs even though the bus consumes them as 2 u16 cells. - `loadstore` stored `rs1_data` as 8 u8 limbs even though pointer arithmetic only needs the low 32 bits as 2 u16 cells. - The byte-shaped vec-heap adapters (`vec_heap`, `vec_heap_branch`, `eq_mod`) reconstructed register pointers through byte limbs and bitwise byte-pair range checks. All of these can store the column as u16 cells directly and replace the high-byte range check with a scaled 16-bit range check on the high u16 cell, removing the `BitwiseOperationLookup` traffic. Separately, algebra's modular `is_eq` chip operates on packed u16 operands and only needs limb-level range checks at `LIMB_BITS = 16`, so it can migrate to the new u16 adapter introduced in PR 2. ## What changes ### RISC-V W-adapters and loadstore Files: `extensions/riscv/circuit/src/adapters/{alu_w,mul_w,loadstore,mod}.rs`, `extensions/riscv/circuit/src/loadstore/{execution,tests}.rs`, `extensions/riscv/circuit/src/{base_alu_w,divrem_w,mul_w,shift_w}/tests.rs`, `extensions/riscv/circuit/cuda/include/riscv/adapters/{alu_w,mul_w,loadstore}.cuh`. - `alu_w` / `mul_w`: `rs1_high`, `rs2_high` are packed as two u16 cells; bus reads zero-extend the lower 32 bits and pull u16-shaped upper cells directly from the trace. - `loadstore`: `rs1_data` shrinks to `RS1_DATA_U16S = 2` u16 cells (the low 32 bits of the rs1 register). Bus reads zero-extend the upper half on the wire. - STORE-to-`DEFERRAL_AS` is explicitly unsupported in this adapter and the now-dead `timed_write_deferral` path is removed — #2794 already dropped that opcode, and the remaining call would have instantiated `TracingMemory::write::<F, 8>`, which violates `BLOCK_FE_WIDTH`. The F-celled path needs a separate two-message bus shape if it is ever reintroduced. - Shared helpers handle u8-pair packing, low-32-bit u16 composition, high-u16 pointer scaling, and 16-bit immediate sign-extension. - W-suffix tests (`base_alu_w`, `divrem_w`, `mul_w`, `shift_w`) updated to prank u16 cells where the column is migrated. ### Shared heap-data adapters Files: `extensions/riscv-adapters/src/{vec_heap,vec_heap_branch,eq_mod,eq_mod_u16,lib}.rs`, `extensions/riscv-adapters/cuda/include/riscv-adapters/{constants,vec_heap,vec_heap_branch}.cuh`. - `vec_heap`, `vec_heap_branch`, and `eq_mod` materialize register pointer values (`rs_val` / `rd_val`) as u16 cells. - Pointer bounds use `VariableRangeChecker` on the high u16 cell instead of byte-pair `BitwiseOperationLookup` send_range. - Adds the byte-shaped vec-heap adapter `READ_SIZE` / `WRITE_SIZE` const generics so byte-shaped 256-bit chips (`BaseAlu256`, `Mul256`, `Shift256`) and u16-shaped 256-bit chips (`LessThan256`) can share the same wrapper without losing the block-size information needed for `MemoryWriteAuxCols` and pointer arithmetic. - Adds `Rv64IsEqualModU16Adapter*` for the new u16-shaped modular `is_eq`. - CUDA mirrors the new column shapes and packing helpers. ### Algebra modular `is_eq` Files: `extensions/algebra/circuit/src/modular_chip/{is_eq,addsub,muldiv,mod,tests}.rs`, `extensions/algebra/circuit/src/extension/{modular,hybrid,fp2}.rs`, `extensions/algebra/circuit/src/{execution,fields,preflight,lib}.rs`, `extensions/algebra/circuit/src/fp2_chip/{addsub,mod,muldiv,tests}.rs`. - `LIMB_BITS` 8 → 16 where the chip's limbs are only consumed as packed u16 values. - Wires through the new `Rv64IsEqualModU16` adapter variant. - Core uses `VariableRangeChecker` for `b_lt_diff - 1` and `c_lt_diff - 1` instead of byte-pair bitwise checks. - CPU, hybrid, and CUDA registration paths all use `Rv64IsEqualModU16Adapter*`. - Negative tests that previously pranked `[F; 4]` byte arrays now prank `[F; 2]` u16 cells where the column is migrated; positive coverage preserved. - Mod-builder generated AIRs are untouched. `is_eq` executor memory reads were already adjusted in #2794 for `MEMORY_BLOCK_BYTES`; PR 4 does not rework that API. ### Cascade through dependents Files: `extensions/bigint/circuit/src/{lib,extension/mod,tests}.rs`, `extensions/bigint/circuit/cuda/src/bigint.cu`, `extensions/ecc/circuit/src/weierstrass_chip/{add_ne,double}/mod.rs`. - Bigint type aliases and ecc weierstrass call sites pick up the new `READ_SIZE` / `WRITE_SIZE` generic positions on `Rv64VecHeapAdapter*`. Byte-shaped 256-bit chips pass `MEMORY_BLOCK_BYTES`; u16-shaped chips pass `BLOCK_FE_WIDTH`. - Bigint pulls `INT256_NUM_U8_LIMBS` from the bigint crate root (per PR 2's reviewer feedback) instead of `riscv/circuit::adapters`. ## Migration notes - `Rv64VecHeapAdapter*` now takes 5 const generics (`NUM_READS, BLOCKS_PER_READ, BLOCKS_PER_WRITE, READ_SIZE, WRITE_SIZE`). Byte-shaped 256-bit chips: pass `MEMORY_BLOCK_BYTES` for both sizes. u16-shaped chips: use `Rv64VecHeapU16Adapter`. - `alu_w` / `mul_w` adapter callsites that materialized `rs_high` bytes should pull u16-shaped upper cells from the new column slot. - `loadstore` `rs1_data` columns shrink from 8 u8s to 2 u16 cells; trace-fillers that previously wrote 8 byte cells should write 2 u16 cells (low 32 bits) and let the AIR zero-extend on the bus. - `is_eq` modular chip callers should switch from `Rv64IsEqualModAdapter*` to `Rv64IsEqualModU16Adapter*` and pass `LANE_SIZE = BLOCK_FE_WIDTH` (= 4 u16 cells per heap-read block). - STORE-to-`DEFERRAL_AS` is no longer routed through `Rv64LoadStoreAdapter`; the F-celled deferral-store path was already removed in #2794 and any reintroduction needs a dedicated two-message bus shape. resolves int-7830, int-7831

PR 5 of the memory-bus-u16 split, stacked on PR 1. Two independent parts. Part 1 — internal u16 reshape, OutputKey byte layout unchanged: - Call core: input_commit, output_commit, output_len trace columns reshape to u16 cells. - Output core: output_commit, output_len reshape to u16 cells. - Canonicity sub-AIR over u16 cells. - Deferral rd_val / rs_val register-pointer columns become u16-shaped. - CPU + CUDA tracegen updates. - Reuses PR1's byte-memory and field-memory chunk helpers; no new generic split_memory_ops-style code is introduced. Part 2 — output commitment-format change (breaks vkey compatibility): - SPONGE_BYTES_PER_ROW = 2 * DIGEST_SIZE = 16; sponge_inputs cells hold u16 values on data rows and receive 16-bit range checks. Init row stays `[deferral_idx, output_len, 0, ...]` and is exempt from u16 checks. - Two memory-bus writes per output row instead of one. - def_fn::hash_output_raw packs byte pairs into each sponge cell. - Bitwise-lookup wiring removed from the output chip only — it no longer needs byte-level range checks. The call chip still uses bitwise. - CUDA + tests updated; expected output commits regenerated. Output length contract: - output_len remains a byte count. - Raw deferral outputs must be a multiple of 16 bytes; non-16-byte aligned raw outputs are invalid (no implicit padding). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-04T10:27:56Z

Claude finished @shuklaayush's task in 3m 33s —— View job

Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs

Reviewed all 35 changed files across Rust AIR, tracegen, CUDA, verify-stark, continuations, riscv-adapters, and the algebra is_eq fix. The migration from byte cells to u16 cells is mechanically clean and the doubled sponge absorption rate is correctly threaded through all layers. No correctness bugs found.

Findings

1. COMMIT_MEMORY_OPS derivation is semantically stale (Nit — numerically correct)

utils.rs:36: COMMIT_MEMORY_OPS is still defined as num_byte_memory_ops(COMMIT_NUM_BYTES) = 32 / MEMORY_BLOCK_BYTES = 4. But its only uses now call split_f_memory_ops::<_, COMMIT_NUM_U16S, COMMIT_MEMORY_OPS>, which operates on u16 cells — the const-assert inside split_f_memory_ops verifies COMMIT_NUM_U16S == COMMIT_MEMORY_OPS * BLOCK_FE_WIDTH (16 == 4×4). The value is correct because COMMIT_NUM_BYTES / MEMORY_BLOCK_BYTES == COMMIT_NUM_U16S / BLOCK_FE_WIDTH, but the derivation formula doesn't reflect the u16 migration. Changing to num_f_memory_ops(COMMIT_NUM_U16S) would be cleaner and prevent confusion if these constants ever diverged.

2. Missing per-cell u16 range checks on input_commit reads is intentional (Confirmed correct)

call/air.rs:167-176 range-checks only the canonicity remainder values for input_commit, while output_commit and output_len get both canonicity RCs and per-cell u16 RCs (lines 181-191). This is correct: input_commit is read from RV64_MEMORY_AS via the memory bridge, so the u16 memory bus guarantees cell canonicity. output_commit and output_len are written, so the chip must prove the written values are canonical u16s.

3. is_eq.rs setup-row fix is correct and necessary (Companion fix)

extensions/algebra/circuit/src/modular_chip/is_eq.rs:402-427: During setup, the record's c carries the modulus value itself (not a value < modulus). With u16 limbs, the previous assert!(c_cmp) would fire since c == modulus. The fix:

Skips c < modulus assertion during setup.
Forces b_diff_idx = READ_LIMBS (out-of-range sentinel) and searches c_diff_idx via reverse scan with .unwrap_or(0).
Forces c_lt_mark = F::TWO during setup, avoiding the single-diff-index optimization path that would conflate b and c canonicity.
Uses wrapping_sub for c_lt_diff since c[i] == modulus[i] during setup yields 0 via wrapping.

The execution side (execute_e12_impl) is also correctly guarded with if !IS_SETUP (line 730).

4. def_fn.rs:91 — F::from_usize upgraded to F::from_u32 (Addressed)

The latest commit now uses F::from_u32(u32::try_from(output_ref.len()).unwrap()), which catches unexpected sizes earlier. This addresses the concern from previous reviews.

5. Sponge init-row range check exemption is correct (Positive)

output/air.rs:334-339: Data-row sponge cells get per-cell u16 range checks gated on local.is_valid - local.is_first, correctly exempting the init row. The init row's sponge_inputs are pinned by assert_array_eq against already-canonical deferral_idx and output_len columns — no separate range check needed.

6. Verify-stark RangeCheckerAir<U16_BITS> is a clean addition

guest-libs/verify-stark/circuit/src/lib.rs:108-110 and trace.rs:47-50: The output commit range checks now use a dedicated RangeCheckerAir<U16_BITS> proving context instead of being aggregated into the range_inputs vec. The trace generator uses RangeCheckerCpuTraceGenerator to collect counts and generates the range checker trace row-major. trace.rs:142 correctly returns range_inputs: vec![] since the range checker now has its own AIR. As noted in reviewer comments, a future PR could unify this with the recursion range checker by bumping it to 16-bit.

7. combine_output_cells + expand_to_rv64_block size coupling is compile-time safe

Both call/air.rs:451 and output/air.rs:307 pass output_len_full: [AB::Expr; BLOCK_FE_WIDTH] (= 4 elements, zero-extended from F_NUM_U16S = 2) to combine_output_cells which expects [T; OUTPUT_LEN_NUM_U16S] (= 4). The equality OUTPUT_LEN_NUM_U16S == BLOCK_FE_WIDTH holds because 8 / 2 == 4, and any mismatch would be a compile error.

8. CUDA and Rust constant + struct consistency (Correct)

def_types.h mirrors every u16-cell constant from utils.rs: COMMIT_NUM_U16S, OUTPUT_LEN_NUM_U16S, OUTPUT_TOTAL_NUM_U16S, SPONGE_BYTES_PER_ROW, SPONGE_ROW_MEMORY_OPS, F_NUM_U16S, RV64_PTR_U16S. The helper functions (pack_u8_pairs_le, u32_bytes_to_le_u16_cells, scale_u16_high_cell, scale_output_len, scale_rv64_ptr_from_u32_bytes) have equivalent Rust counterparts. The CUDA column structs (DeferralOutputCols, DeferralCallCoreCols, DeferralCallAdapterCols) match the Rust #[repr(C)] layouts, and sizeof assertions guard against drift.

9. BitwiseOperationLookup fully removed from deferral (Clean removal)

extension/mod.rs no longer creates/looks up bitwise chips, cuda_abi.rs drops d_bitwise, extension/cuda.rs drops get_or_create_bitwise_op_lookup, all tests drop bitwise periphery. The canonicity sub-AIR and per-cell range checks now use VariableRangeCheckerBus exclusively.

10. e2e.rs test correctly delegates to generate_deferral_results (Good cleanup)

crates/continuations/src/tests/e2e.rs:134-145: compute_output_f_commit now delegates to generate_deferral_results + deferral_poseidon2_chip() instead of reimplementing the sponge hash. This eliminates the risk of the test diverging from production when the sponge rate changes.

11. Guest test outputs correctly updated for 16-byte alignment

All guest programs (single.rs, multiple.rs in both deferral/tests/ and continuations/) extend expected outputs from 8 to 16 bytes. The new test output_raw_len_must_be_sponge_row_aligned verifies the panic path. The deferral/tests/src/lib.rs input arrays are extended to 16 bytes.

Addressing reviewer comments

@stephenh-axiom-xyz — u8 vs u16 in records: The byte-shaped record types (DeferralCallReadsBytes, DeferralCallWritesBytes) are intentional. The execution record captures byte-level data; the trace filler packs bytes into u16 column cells. The column types are now u16-shaped.
@stephenh-axiom-xyz — BitwiseOperationLookupAir for u16 ops: Correctly noted by @shuklaayush that the table would need 2^32 entries, making it impractical.
@stephenh-axiom-xyz — recursion range checker: The current PR adds RangeCheckerAir<U16_BITS> in verify-stark. A future PR could unify with the recursion range checker.
@GunaDD — multiple.elf: Pre-compiled ELF for the updated guest program — expected to be committed alongside the source change.
@GunaDD — RootCircuit::new assertion: Upgrades existing debug_assert! to assert! via centralized assert_user_pvs_shape. The requirements were always needed but not checked in release builds.

Summary

The PR is well-executed. The u16 cell migration is consistent across all layers, the doubled sponge rate is correctly factored through shared constants, and the is_eq.rs companion fix is a necessary correction for setup-row correctness with u16 limbs. The only actionable nit is the COMMIT_MEMORY_OPS derivation formula (finding #1), which doesn't affect correctness.

Verdict: Looks good to merge.

github-actions · 2026-06-04T10:32:59Z

group	app.proof_time_ms	app.cycles	leaf.proof_time_ms
fibonacci	1,555	4,000,051	438
keccak	13,965	14,365,133	2,378
sha2_bench	9,193	11,167,961	1,426
regex	1,586	4,090,656	356
ecrecover	488	112,210	268
pairing	596	592,827	253
kitchen_sink	1,991	1,979,971	404

Note: cells_used metrics omitted because CUDA tracegen does not expose unpadded trace heights.

Commit: d457878

Benchmark Workflow

Golovanov399 and others added 30 commits March 13, 2026 21:02

chore: Remove the now redundant DagCommitBus (#2551)

e9e190e

docs: migrate recursion docs to OpenVM (#2553)

15481ae

fix: constrain is_last in ProofShapeAir (#2557)

9dd11af

feat: verify-stark guest library and SDK integration (#2555)

5693ea9

fix: Make it impossible for a row to be first and second (#2556)

617dfe1

This resolves INT-6814

fix: assert there is one row when flag in {0, 2} (#2561)

a0036c9

fix: deduplicate MerkleTreeSubAir and UserPvsCommitSubAir (#2559)

03337ef

fix: constrain row_idx_flags to row_idx in UserPvsCommitAir (#2562)

331aec5

fix: constrain is_valid start value in DeferralCircuitCountAir (#2567)

896c733

fix: constrain initial + final merkle proofs are equal up to address_…

f452060

…height (#2564)

fix: add num_def_circuit_proofs to DeferralAggregationPvs (#2571)

60aa225

fix: final deferral input commit should be hash slice of commits (#2568)

cec4fb8

fix: verify-stark guest lib findings (#2569)

c3c0682

docs: explain the correspondence between verifier circuit and verify() (

959d78f

#2570)

feat: deferral support for SDK root prover (#2573)

7b4a355

fix: deferral output_commit should be a true hash (#2574)

732eb66

fix: constrain F values in rv32 memory are canonically decomposed (#2576

14d26b1

)

fix: range check shifted top byte of deferral heap pointers + output_…

5fa9a72

…len (#2579)

chore: differentiate def_hook_cached_commit and def_hook_vk_commit (#…

f43ae34

…2580)

fix: host verify crate should match root circuit w.r.t. recursion_flag (

50e7f47

#2582)

chore: add continuations, SDK, and deferral workflows (#2589)

f7ceb3f

chore: remove native extension-specific code (#2590)

9f6721a

fix(v2-beta): get cargo-openvm compiling (#2588)

717654b

~~- we need `cargo-openvm` from `develop-v2.0.0-rc.1` to run the reth benchmark for this branch since it adds new opcodes~~ resolves int-6949

chore: feature-gate root prover and Bn254 in continuations (#2593)

11be8b6

docs: update continuations and deferral documentation (#2596)

06fe9ed

fix: host verify checks Proof cached commit (#2598)

28238f7

This comment was marked as outdated.

Sign in to view

shuklaayush force-pushed the feat/memory-bus-u16-pr5 branch from eef7099 to 5d9fd3b Compare May 26, 2026 18:49

This comment has been minimized.

Sign in to view

stephenh-axiom-xyz reviewed May 26, 2026

View reviewed changes

GunaDD reviewed May 29, 2026

View reviewed changes

Comment thread crates/continuations/programs/examples/multiple.elf

GunaDD May 29, 2026

Copy link
Copy Markdown

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this accidentally included ?

GunaDD reviewed May 29, 2026

View reviewed changes

Base automatically changed from feat/memory-bus-u16-pr1 to develop-v2.1.0-rv64 June 3, 2026 00:14

shuklaayush force-pushed the feat/memory-bus-u16-pr5 branch from 5d9fd3b to d29250f Compare June 4, 2026 09:08

shuklaayush and others added 5 commits June 4, 2026 11:59

fix(cli): run toolchain command off async runtime

067eafa

shuklaayush force-pushed the develop-v2.1.0-rv64 branch from 72d313f to e6803a8 Compare June 4, 2026 10:04

shuklaayush and others added 3 commits June 4, 2026 12:05

fix(algebra): avoid setup is-eq modulus assert

74ddeba

fix(deferral): clean u16 CUDA helpers and comments

d457878

shuklaayush force-pushed the feat/memory-bus-u16-pr5 branch from d29250f to d457878 Compare June 4, 2026 10:27

shuklaayush force-pushed the develop-v2.1.0-rv64 branch 4 times, most recently from 2995a72 to e582d73 Compare June 7, 2026 11:31

jonathanpwang removed their request for review June 9, 2026 01:08

shuklaayush force-pushed the develop-v2.1.0-rv64 branch from dac2c10 to db32ec4 Compare June 10, 2026 18:01

		@@ -78,7 +81,10 @@ fn hash_output_raw<F: VmField>(
		deferral_idx: u32,
		output_ref: &[u8],

Conversation

shuklaayush commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changes

Trace columns to u16 cells

Sponge absorption rate

Output length contract

Migration notes

Uh oh!

This comment was marked as outdated.

github-actions Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs

Findings

Summary

Uh oh!

This comment has been minimized.

stephenh-axiom-xyz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs

Findings

Addressing reviewer comments

Summary

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs

Findings

Addressing reviewer comments

Summary

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

shuklaayush commented May 22, 2026 •

edited

Loading

github-actions Bot commented May 26, 2026 •

edited

Loading

github-actions Bot commented Jun 4, 2026 •

edited

Loading

github-actions Bot commented Jun 4, 2026 •

edited

Loading