Skip to content

feat(v2.1-rv64): switch to u16 limbs in deferral airs#2808

Open
shuklaayush wants to merge 615 commits into
develop-v2.1.0-rv64from
feat/memory-bus-u16-pr5
Open

feat(v2.1-rv64): switch to u16 limbs in deferral airs#2808
shuklaayush wants to merge 615 commits into
develop-v2.1.0-rv64from
feat/memory-bus-u16-pr5

Conversation

@shuklaayush

@shuklaayush shuklaayush commented May 22, 2026

Copy link
Copy Markdown
Collaborator

PR 5 of the memory-bus-u16 split, stacked on #2794. Shrinks deferral trace columns whose constraints only operate on packed u16 values, and doubles the sponge absorption rate of the deferral output chip from one digest per row to two.

Why

#2794 still stores deferral commit / output-key columns byte-shaped while packing the memory-bus payloads. Shrinking those columns to u16 cells halves the affected trace width and lets the canonicity sub-AIR use per-u16-cell 16-bit range checks instead of byte-pair BitwiseOperationLookup checks.

Separately, the deferral output chip absorbs only DIGEST_SIZE = 8 bytes per Poseidon row even though the sponge state can hold 2 * DIGEST_SIZE = 16 bytes. Doubling the absorption rate halves the number of output rows for the same byte stream.

What changes

Trace columns to u16 cells

Files: extensions/deferral/circuit/src/{call,output,canonicity}/, extensions/deferral/circuit/src/utils.rs, extensions/deferral/circuit/cuda/.

  • Call core: input_commit, output_commit, output_len are u16 cells.
  • Output core: output_commit, output_len are u16 cells.
  • Canonicity sub-AIR operates over u16 cells; upper-bound checks become per-u16-cell 16-bit range checks (instead of byte-pair bitwise checks).
  • Deferral rd_val / rs_val register-pointer columns are u16-shaped.
  • Reuses feat(v2.1-rv64): switch memory bus blocks to u16 cells #2794's split_byte_memory_ops / split_f_memory_ops and byte_memory_op_chunk / f_memory_op_chunk helpers; no new split_memory_ops-style helper is introduced.
  • CUDA tracegen mirrors the new column shapes.

Sponge absorption rate

Files: extensions/deferral/circuit/src/output/{air,trace,execution,tests}.rs, extensions/deferral/circuit/src/def_fn.rs, extensions/deferral/circuit/src/extension/mod.rs, extensions/deferral/circuit/cuda/src/output.cu.

  • SPONGE_BYTES_PER_ROW = 2 * DIGEST_SIZE = 16 (was DIGEST_SIZE = 8).
  • sponge_inputs cells hold u16 values on data rows, each carrying two output bytes; data-row sponge_inputs receive 16-bit range checks. The init row remains [deferral_idx, output_len, 0, ...] and is exempt.
  • Two memory-bus writes per output row instead of one — the row now consumes 16 bytes of output, not 8.
  • def_fn::hash_output_raw packs byte pairs into each sponge cell before absorbing.
  • Bitwise-lookup wiring removed from the output chip; the call chip still uses bitwise lookup for non-canonical commit checks.

Output length contract

  • output_len remains a byte count.
  • Raw deferral outputs must be a multiple of 16 bytes; non-16-byte-aligned outputs are invalid with no implicit padding.
  • Host-side failure-path test asserts the panic when a non-16-aligned raw output is submitted.

Migration notes

  • Hosts producing deferral output must align raw outputs to a multiple of 16 bytes.
  • Output-chip configs that previously passed bitwise_lu should drop the argument.

resolves int-7834

Golovanov399 and others added 30 commits March 13, 2026 21:02
## Summary
- Removed `openvm/` prefix from source paths across 8 documentation
files (21 occurrences)
- Updated stale `v2-proof-system` repository reference in
`docs/crates/recursion/README.md` to reflect the current openvm repo
- Updated `stark-backend` path in
`docs/crates/recursion/verifier-mapping.md` to link to the
[stark-backend GitHub repo](https://github.com/openvm-org/stark-backend)

Fixes issues identified in
#2553 (comment)

## Test plan
- [x] Verified no remaining `openvm/crates/recursion/src/` prefixes
exist
- [x] Verified no remaining `v2-proof-system` references exist
- [x] Verified no remaining `stark-backend/crates/` local path
references exist

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This resolves INT-6828.

Now `Eq3bAir` propagates the `row_idx` by the correct amount for each
AIR. Note that there is no need for this air to know `air_idx` or
anything other than the number of interactions and the `n_lift` for this
air, so no new columns are added (but we now do the interaction with
proof shape air on the last row of the air instead of the first one).
…2563)

## Summary
- Add a `changes` detection job using `dorny/paths-filter@v3` to skip
the `lint-cuda` job when no CUDA-related files (`*.cu`, `*.cuh`,
`**/cuda/**`, `**/cuda*.rs`, etc.) are modified
- Request at least 8 CPUs on the GPU runner (`/cpu=8`) for faster builds
- The workflow file and CUDA cache action are also included as triggers
so changes to CI itself still run the CUDA lint

## Test plan
- [ ] Open a PR that doesn't touch any CUDA files and verify `lint-cuda`
is skipped
- [ ] Open a PR that touches a `.cu` or `.cuh` file and verify
`lint-cuda` runs

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
~~- we need `cargo-openvm` from `develop-v2.0.0-rc.1` to run the reth
benchmark for this branch since it adds new opcodes~~

resolves int-6949
## Summary

- Adds `scripts/pre-push.sh`, a lightweight bash script that runs
formatting, clippy, and tests locally on only the crates changed
compared to a target branch (default: `develop-v2.0.0-beta`)
- Auto-detects NVIDIA GPU via `nvidia-smi` and enables `--features
cuda,touchemall` when present; CPU-only otherwise
- Uses `cargo nextest` with `--cargo-profile=fast` when available (falls
back to `cargo test`), and applies the `heavy` nextest profile for
integration test crates

## Details

The script follows three steps:
1. **`cargo +nightly fmt --all -- --check`** — full workspace format
check (fast, nightly required per repo config)
2. **`cargo clippy`** — per changed crate, with appropriate feature
flags and `-D warnings`
3. **`cargo nextest run`** — per changed crate, with
`--cargo-profile=fast` and feature detection

Crate detection walks changed file paths upward to find the nearest
`Cargo.toml` with a `[package]` section. Benchmark and guest program
crates are skipped (they require nightly `rust-src`).

## Usage

```bash
./scripts/pre-push.sh                          # compare against develop-v2.0.0-beta
./scripts/pre-push.sh main                     # compare against main
./scripts/pre-push.sh origin/my-feature        # any valid git ref
```

## Test plan

- [ ] Run on a branch with Rust changes and verify only affected crates
are checked
- [ ] Run on a branch with no Rust changes (e.g., docs-only) and verify
early exit
- [ ] Run on a machine with GPU and verify `cuda,touchemall` features
appear in output
- [ ] Run on a machine without GPU and verify only `parallel` feature is
used
- [ ] Verify formatting failures are caught and reported

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com>
@github-actions

This comment was marked as outdated.

@shuklaayush shuklaayush force-pushed the feat/memory-bus-u16-pr5 branch from eef7099 to 5d9fd3b Compare May 26, 2026 18:49
@github-actions

github-actions Bot commented May 26, 2026

Copy link
Copy Markdown

Claude finished @shuklaayush's task in 2m 21s —— View job


Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs

  • Gather context and read the diff
  • Review deferral call chip changes (air, trace, tests)
  • Review deferral output chip changes (air, trace, execution, tests)
  • Review canonicity sub-AIR changes
  • Review utility functions and helpers
  • Review CUDA tracegen changes
  • Review continuations and verify-stark changes
  • Review guest program and test changes
  • Post final review

This is a well-structured PR that consistently migrates deferral columns from byte cells to u16 cells while doubling the output sponge absorption rate. The changes are mechanically clean and internally consistent across Rust AIR, tracegen, CUDA, and the verify-stark subcircuit. I have a few observations:


Findings

1. combine_output_cells accepts OUTPUT_LEN_NUM_U16S but caller passes BLOCK_FE_WIDTH

In call/air.rs:467 and output/air.rs:305-306, the output-len is expanded via expand_to_rv64_block into a [AB::Expr; BLOCK_FE_WIDTH] (8 elements), then passed to combine_output_cells which expects [T; OUTPUT_LEN_NUM_U16S] (4 elements). This works because combine_output_cells takes output_len_cells: [T; OUTPUT_LEN_NUM_U16S] and the array size is checked at compile time — but only because OUTPUT_LEN_NUM_U16S == 4 and BLOCK_FE_WIDTH == 8. Wait — actually this should be a compile error or the function signature accepts BLOCK_FE_WIDTH.

Let me re-read: combine_output_cells signature is:

pub(crate) fn combine_output_cells<T>(
    output_commit: impl IntoIterator<Item = T>,
    output_len_cells: [T; OUTPUT_LEN_NUM_U16S],  // 4 elements
) -> [T; OUTPUT_TOTAL_NUM_U16S]

But the callers pass output_len_full: [AB::Expr; BLOCK_FE_WIDTH] (8 elements). This should fail to compile unless OUTPUT_LEN_NUM_U16S == BLOCK_FE_WIDTH (both happen to be 4 since U16_CELL_SIZE=2, so OUTPUT_LEN_NUM_BYTES/U16_CELL_SIZE = 8/2 = 4 and BLOCK_FE_WIDTH = 4). With BLOCK_FE_WIDTH = 4 this works. The type-level coupling is safe since both are 4, but worth noting this relies on BLOCK_FE_WIDTH == OUTPUT_LEN_NUM_U16S. (No action needed, just an observation.)

2. Sponge init-row range checks are correctly exempted (Positive)

output/air.rs:332-337 — The data-row sponge cells get per-cell u16 range checks gated on local.is_valid - local.is_first, correctly exempting the init row (whose sponge_inputs are constrained via assert_array_eq against already-canonical deferral_idx and output_len columns). This is clean.

3. def_fn.rs:91output_ref.len() is cast to F via from_usize, which for BabyBear can silently reduce values ≥ 2^31 (Low risk)

def_fn.rs:91:

state[1] = F::from_usize(output_ref.len());

If the output length ever exceeded P ≈ 2^31 - 1 (extremely unlikely given the 16-byte-aligned constraint + address_bits bound), this would silently wrap. The address_bits range check in the AIR prevents this from happening in practice, so this is not a real bug, but using F::from_u32(u32::try_from(output_ref.len()).unwrap()) would make the intent clearer and catch unexpected sizes earlier.

4. Missing range check on input_commit cells in the call chip (Question)

In call/air.rs, the canonicity sub-AIR is applied to input_commit and the returned value is range-checked. But unlike output_commit and output_len (which get per-cell u16 range checks at lines 171-179), input_commit does not appear to get per-cell u16 range checks — only the canonicity remainder value is range-checked. For output_commit, both the canonicity RC and the per-cell RC are applied.

This is likely intentional: input_commit is read from memory (the memory bus verifies the cells are well-formed), while output_commit is written (and output_len is written), so the per-cell RC on writes ensures the values being committed to memory are canonical u16s. The canonicity sub-AIR on input_commit only needs its returned value range-checked since each limb is already guaranteed canonical by the read. Worth a quick confirmation this reasoning is correct.

5. Verify-stark range_inputs emptied, replaced by dedicated RangeCheckerAir (Structural change, looks correct)

trace.rs:145 now returns range_inputs: vec![] and instead passes the range_ctx from the output trace generator as a dedicated RangeCheckerAir<U16_BITS> proving context. This is a clean separation — the range checker now has its own trace in the subcircuit rather than being aggregated. The lib.rs correctly wires up the new RangeCheckerAir<U16_BITS> in both DeferredVerifyCircuit and the circuit AIR list.

6. RootCircuit refactored from derive_new to manual new with validation (Good defensive change)

The RootCircuit::new constructor now validates num_user_pvs via assert_user_pvs_shape, and the same assertion is used in commit/air.rs, commit/trace.rs, memory/air.rs, and memory/trace.rs. This replaces scattered debug_assert! calls with a single function and upgrades some to assert!. Good hygiene.

7. CUDA and Rust column layout consistency

The CUDA DeferralOutputCols<T> and DeferralCallCoreCols<T> / DeferralCallAdapterCols<T> structs mirror the Rust #[repr(C)] AlignedBorrow structs. The column shapes (RV64_PTR_U16S, COMMIT_NUM_U16S, F_NUM_U16S, SPONGE_ROW_MEMORY_OPS) are consistently defined in def_types.h to match utils.rs. The sizeof assertions (assert(width == sizeof(...))) guard against drift. Looks consistent.

8. Guest test expected outputs updated for 16-byte alignment (Correct)

All guest program expected outputs (single.rs, multiple.rs in both deferral/tests/ and continuations/) have been extended from 8-byte to 16-byte arrays. The make_result function in tests asserts the new alignment (SPONGE_BYTES_PER_ROW). The new test output_raw_len_must_be_sponge_row_aligned verifies the panic path for non-aligned outputs.

9. Minor: combine_output_cells function naming

combine_output_cells creates the concatenation [output_commit || output_len] from u16 cells. The name is clear but could be confused with combine_output (the byte version). Both coexist in utils.rs. This is fine but a single generic version parameterized by cell count could reduce the duplication in the future.


Summary

The PR is well-executed with consistent changes across the Rust AIR, tracegen, CUDA tracegen, and the verify-stark subcircuit. The key architectural change (u16 cells + doubled sponge rate) is cleanly factored through shared constants in utils.rs / def_types.h. The defensive validations (RootCircuit::new, alignment assertion test) are good additions. I didn't find any correctness bugs.

Verdict: Looks good to merge. The observations above are minor and don't block.

@github-actions

This comment has been minimized.

@stephenh-axiom-xyz stephenh-axiom-xyz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be a lot of places that still use u8 instead of u16, I've pointed some out in comments

&self,
builder: &mut AB,
x: &[AB::Var],
x: [AB::Expr; F_NUM_U16S],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is it possible to use a slice of Into<AB::Expr> or smth

output_commit_bus,
def_idx: self.def_idx,
};
let output_range_air = RangeCheckerAir::<U16_BITS> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given you converted the RISC-V chips to do 16-bit limb addition/multiplication, I think you could change the range checker in recursion to be 16-bits instead of adding it here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah, i was trying to not touch recursion but i guess maybe just bumping the recursion range checker to be 16-bits isn't that big of a deal

// The trace filler packs byte pairs into the u16-shaped columns.
#[repr(C)]
#[derive(Clone, Copy, Debug)]
pub struct DeferralCallReadsBytes<F> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not initialize the DeferralCallReads<u16, F> during execution?

@@ -50,8 +49,8 @@ use crate::{
#[repr(C)]
#[derive(AlignedBorrow, StructReflection, Clone, Copy, Debug)]
pub struct DeferralCallReads<B, F> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: B stands for bytes

Comment thread extensions/deferral/circuit/src/call/air.rs
.fold(idx as u8, |acc, byte| acc.wrapping_add(*byte));
let num_chunks = (seed as usize % 4) + 1;
let len = num_chunks * DIGEST_SIZE;
let len = num_chunks * SPONGE_BYTES_PER_ROW;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the length changed here? The input should be in u16s now no?


// First row sponge input is [deferral_idx, output_len, 0, ...].
// Later rows sponge and write the next DIGEST_SIZE output bytes.
// Later rows sponge and write the next SPONGE_BYTES_PER_ROW output bytes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we measuring this in u16s?

@@ -78,7 +81,10 @@ fn hash_output_raw<F: VmField>(
deferral_idx: u32,
output_ref: &[u8],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment, why not u16?

Add comprehensive test coverage for decoder crate.

`cargo tarpaulin` reports:
```bash
|| crates/toolchain/decoder/src/instruction_formats.rs: 48/48 +0.00%
|| crates/toolchain/decoder/src/process_instruction.rs: 148/150 +0.00%
```

The remaining 2 uncovered lines are unreachable match arms (`_ => None`)
for a fully exhaustive 3-bit funct3 match in `process_instruction.rs`.



towards INT-6467

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this accidentally included ?

Comment on lines 45 to 48
debug_assert!(num_user_pvs >= DIGEST_SIZE);
debug_assert!(num_user_pvs.is_multiple_of(DIGEST_SIZE));
debug_assert!((num_user_pvs / DIGEST_SIZE).is_power_of_two());
assert_user_pvs_shape(num_user_pvs);
let encoder = Encoder::new(num_user_pvs / DIGEST_SIZE, MAX_ENCODER_DEGREE, true);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is unrelated to the u16 memory bus change

memory_dimensions: MemoryDimensions,
num_user_pvs: usize,
) -> Self {
assert_user_pvs_shape(num_user_pvs);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this assertion added now ? was this always needed or is this a new requirement

Base automatically changed from feat/memory-bus-u16-pr1 to develop-v2.1.0-rv64 June 3, 2026 00:14
@shuklaayush shuklaayush force-pushed the feat/memory-bus-u16-pr5 branch from 5d9fd3b to d29250f Compare June 4, 2026 09:08
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Claude finished @shuklaayush's task in 2m 49s —— View job


Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs

  • Gather context and read the full diff
  • Review deferral call chip changes
  • Review deferral output chip changes
  • Review canonicity sub-AIR changes
  • Review utility functions and helpers
  • Review CUDA tracegen changes
  • Review continuations and verify-stark changes
  • Review guest program and test changes
  • Address reviewer comments
  • Post final review

This is a clean, well-structured PR that consistently migrates deferral columns from byte cells to u16 cells and doubles the output sponge absorption rate. I reviewed all 40 changed files across Rust AIR, tracegen, CUDA, verify-stark, and continuations. Below are my findings, ordered by significance.


Findings

1. Missing per-cell u16 range checks on input_commit in call core AIR (Question — likely intentional)

In call/air.rs:170-173, the input_commit_rcs (canonicity remainder values) are range-checked. But unlike output_commit and output_len (which get per-cell u16 range checks at lines 181-191), input_commit cells themselves are not individually range-checked in the core AIR. The adapter AIR reads input_commit from RV64_MEMORY_AS via the memory bridge (line 411-424), so the memory bus guarantees the cells are well-formed u16 values — the write side of the bus must have range-checked them. This reasoning is sound provided the u16 memory bus in PR #2794 enforces cell canonicity on writes. Worth confirming this invariant holds.

The tracegen in call/trace.rs:182-184 correspondingly only range-checks output_commit and output_len cells — input_commit cells are omitted — which is consistent with the AIR.

2. DeferralCallReadsBytes / DeferralCallWritesBytes are not #[repr(C)] aligned to DeferralCallReads / DeferralCallWrites (By design, not a bug)

The new byte-shaped record types (call/air.rs:72-91) have different field types than the u16-cell AIR columns. The trace filler manually packs bytes into u16 cells (call/trace.rs:156-164). This is a reasonable pattern — the execution record stays in the natural byte representation, and the filler handles the packing. The #[repr(C)] on the byte structs ensures the layout is predictable for the AlignedBytesBorrow derive on DeferralCallCoreRecord.

3. output_len_full size coupling between expand_to_rv64_block and combine_output_cells (Safe, compile-time checked)

combine_output_cells expects [T; OUTPUT_LEN_NUM_U16S] (= 4). expand_to_rv64_block returns [T; BLOCK_FE_WIDTH] (= 4). These match because OUTPUT_LEN_NUM_BYTES / U16_CELL_SIZE = 8/2 = 4 = BLOCK_FE_WIDTH. If either constant changed independently, this would be a compile error, so it's safe. But the semantic meaning differs: the expand_to_rv64_block zero-extends the F_NUM_U16S = 2 output length cells to 4, and then combine_output_cells treats all 4 as u16 cells for the output key. The upper 2 cells are always zero since output_len is a u32 value stored in F_NUM_U16S = 2 cells. This is correct.

4. Sponge init-row range-check exemption is correct (Positive)

output/air.rs:332-337 (in the diff) — Data-row sponge cells get per-cell u16 range checks gated on local.is_valid - local.is_first, correctly exempting the init row. The init row's sponge_inputs are pinned by assert_array_eq against already-canonical deferral_idx and output_len columns.

5. def_fn.rs:91F::from_usize(output_ref.len()) silent wrapping for huge lengths (Low risk)

state[1] = F::from_usize(output_ref.len());

For BabyBear, values >= 2^31 would silently reduce mod P. The address_bits range check in the AIR prevents this in practice, but using F::from_u32(u32::try_from(output_ref.len()).unwrap()) would make the intent clearer and catch unexpected sizes earlier in host code. This is a minor defensive improvement, not a soundness issue.

6. e2e.rs test refactored to use generate_deferral_results directly (Good cleanup)

crates/continuations/src/tests/e2e.rs:134-145 — The compute_output_f_commit function was previously a manual reimplementation of the sponge hash. It now delegates to generate_deferral_results, which is the canonical implementation. This eliminates the risk of the test's hash diverging from the production code when the sponge rate changes. Well done.

7. RootCircuit::new validation and assert_user_pvs_shape (Good defensive change)

crates/continuations/src/circuit/root/mod.rs:41-55 — The manual new constructor with assert_user_pvs_shape replaces derive_new, upgrading scattered debug_assert! calls to a single assert! function. This is used in commit/air.rs, commit/trace.rs, memory/air.rs, and memory/trace.rs. Good hygiene that catches misconfigurations in release builds.

8. CUDA / Rust constant and column struct consistency (Correct)

def_types.h mirrors every constant from utils.rs: COMMIT_NUM_U16S, OUTPUT_LEN_NUM_U16S, OUTPUT_TOTAL_NUM_U16S, SPONGE_BYTES_PER_ROW, SPONGE_ROW_MEMORY_OPS, F_NUM_U16S, RV64_PTR_U16S. The DeferralOutputCols<T> and DeferralCallCoreCols<T> / DeferralCallAdapterCols<T> structs in CUDA match the Rust #[repr(C)] layouts. sizeof assertions guard against drift. The helper functions (pack_u8_pairs_le, u32_bytes_to_le_u16_cells, scale_u16_high_cell, scale_output_len, scale_rv64_ptr_from_u32_bytes) mirror their Rust counterparts.

9. bitwise_op_lookup fully removed from deferral extension (Clean removal)

The BitwiseOperationLookupBus / BitwiseOperationLookupChip dependency has been completely removed from the deferral extension: extension/mod.rs no longer creates or looks up bitwise chips, cuda_abi.rs drops the d_bitwise parameter, and all tests drop bitwise periphery. The call chip still uses the canonicity sub-AIR (which was already not using bitwise), and the output chip's bitwise usage is replaced by VariableRangeCheckerBus range checks. This is a clean factoring.

10. is_eq.rs modular chip fix (Tangential but correct)

extensions/algebra/circuit/src/modular_chip/is_eq.rs — The setup-row handling is fixed: during setup, c may equal the modulus (it's the modulus itself), so the c < modulus assertion is correctly skipped. The c_diff_idx fallback finds the first differing limb or defaults to 0, and uses wrapping_sub to avoid panic when c == modulus. The c_lt_mark is forced to F::TWO during setup to avoid the single-diff-index optimization colliding with the setup row's different semantics. This looks like a necessary fix for setup correctness with u16 limbs.

11. Adapter ptr_to_u16_limbs renamed to ptr_to_field_u16_limbs (Minor, in riscv-adapters)

vec_heap_u16.rs and vec_heap_branch_u16.rs update the function name. The rename avoids confusion with the raw u16 version — ptr_to_field_u16_limbs returns [F; N] directly.


Addressing reviewer comments

@stephenh-axiom-xyz's comment about u8 vs u16: Several places (record types, output_len bytes) still use u8 in the record / host-side representation. This is intentional — the execution record captures byte-level data, and the trace filler packs into u16 cells. The column types (DeferralCallReads<T, T>, DeferralCallWrites<T, T>) are now u16-shaped.

@stephenh-axiom-xyz's comment about BitwiseOperationLookupAir for u16 ops: Correctly noted by @shuklaayush that the table size would need to be 2^32, making it impractical.

@stephenh-axiom-xyz's comment about recursion range checker: @shuklaayush noted this could be bumped to 16-bit. The current PR adds a dedicated RangeCheckerAir<U16_BITS> in the verify-stark subcircuit (lib.rs:108), which is clean but does add an extra AIR. A future PR could unify this with the recursion range checker.

@GunaDD's comment about multiple.elf: This is the pre-compiled ELF for the updated guest program — it's expected to be committed alongside the source change.

@GunaDD's comment about RootCircuit::new assertion: This upgrades existing debug_assert! checks to assert! and centralizes them. The requirements (num_user_pvs >= DIGEST_SIZE, multiple of DIGEST_SIZE, power-of-two leaf count) were always needed — they just weren't checked in release builds.


Summary

The PR is well-executed with consistent changes across Rust AIR, tracegen, CUDA, verify-stark, and continuations. The migration from byte cells to u16 cells is mechanically clean, and the doubled sponge absorption rate is correctly threaded through all layers. The is_eq.rs fix is a necessary companion change for setup-row correctness with u16 limbs. I didn't find any correctness bugs.

Verdict: Looks good to merge. The only actionable suggestion is the minor defensive F::from_u32(u32::try_from(...).unwrap()) in def_fn.rs:91.

shuklaayush and others added 5 commits June 4, 2026 11:59
# Switch memory bus blocks to u16 cells

PR 1 of the memory-bus-u16 split. Atomic flip of the memory-bus block
from `8 u8 cells` to `4 u16 cells`. The bus interaction is now
fixed-shape (`BLOCK_FE_WIDTH = 4`), and byte-shaped chips pack their u8
columns into bus cells via a single helper.

## Why

`MemoryBus` interactions previously carried 8 field-element-encoded u8s
per message. RV64 cell-typed memory (registers, RV memory, public
values) is naturally u16-aligned, and packing two bytes per bus cell
halves the bus payload width without losing constraint power. This PR
makes the bus shape a single hard constant (`BLOCK_FE_WIDTH`) so the
bridge API and downstream chips no longer need a `const N` generic for
it.

Follow-up PRs will use the wider cells to narrow trace columns, prune
redundant range checks, reshape SHA-2 / Keccak state, and revisit
deferral.

## What changes

### Memory layout constants

Files: `crates/vm/src/arch/config.rs`,
`crates/vm/cuda/include/system/memory/params.cuh`.

- Single source of truth: `U16_CELL_SIZE = size_of::<u16>()` (intrinsic)
drives `BUS_PTR_SCALE`; `BLOCK_FE_WIDTH = 4` is the design choice;
`MEMORY_BLOCK_BYTES`, `BUS_BLOCK_STRIDE`, `BUS_LEAF_STRIDE`,
`DIGEST_WIDTH` all derive.
- Cross-constant equality asserts removed where the derivation makes
them true by construction.
- One divisibility guard kept on
`DIGEST_WIDTH.is_multiple_of(BLOCK_FE_WIDTH)`.
- CUDA-side `params.cuh` mirrors the constants and includes ASCII
diagrams for the u16-celled AS layout (RV64) and the F-celled AS layout
(`DEFERRAL_AS`), showing how `byte_ptr`, `cell_idx`, and `bus_ptr`
relate.

### Layout reference

Terminology:

- **Cell** — one storage word in an address space (`u16` for RV64 ASes,
`F` for `DEFERRAL_AS`).
- **Block** — one memory-bus message: `BLOCK_FE_WIDTH` cells =
`MEMORY_BLOCK_BYTES` bytes.
- **Digest** — output of one Poseidon2 compression, `DIGEST_WIDTH`
cells; also one merkle leaf.
- **Leaf** — one merkle-tree leaf = one Poseidon2 half = `DIGEST_WIDTH`
cells = `BLOCKS_PER_LEAF` blocks.

<details>
<summary><b>u16-celled AS layout</b> (<code>RV64_REGISTER_AS</code>,
<code>RV64_MEMORY_AS</code>, <code>PUBLIC_VALUES_AS</code>)</summary>

One merkle leaf = 16 bytes = 8 u16 cells = 2 bus blocks:

```text
  byte_ptr:     0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15
               ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
  u8 storage:  │b0 │b1 │b2 │b3 │b4 │b5 │b6 │b7 │b8 │b9 │b10│b11│b12│b13│b14│b15│
               └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
               ╰──────╯╰──────╯╰──────╯╰──────╯╰──────╯╰──────╯╰──────╯╰──────╯   u16 cells (LE)
  cell_idx:        0       1       2       3       4       5       6       7
  bus_ptr:         0       2       4       6       8      10      12      14
               ╰─────────── block 0 ──────────╯╰─────────── block 1 ──────────╯
               ╰────────────────── one merkle leaf / digest ──────────────────╯
```

- `byte_ptr = U16_CELL_SIZE * cell_idx` (coincides with `bus_ptr` for
u16 ASes)
- `bus_ptr  = BUS_PTR_SCALE  * cell_idx`
- Block stride on bus: `BUS_BLOCK_STRIDE = 8 = BUS_PTR_SCALE *
BLOCK_FE_WIDTH`
- Leaf stride on bus: `BUS_LEAF_STRIDE = 16 = BUS_PTR_SCALE *
DIGEST_WIDTH`

</details>

<details>
<summary><b>F-celled AS layout</b> (<code>DEFERRAL_AS</code>)</summary>

Each cell holds one `F` element (`size_of::<F>()` bytes on host; 4 for
BabyBear). The bus addresses cells directly, so cell / block / leaf
counts and `bus_ptr` / `cell_idx` mappings match the u16 AS.

```text
  byte_ptr:       0       4       8       12      16      20      24      28
               ┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
  F storage:   │  F0   │  F1   │  F2   │  F3   │  F4   │  F5   │  F6   │  F7   │
               └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
  cell_idx:        0       1       2       3       4       5       6       7
  bus_ptr:         0       2       4       6       8      10      12      14
               ╰─────────── block 0 ──────────╯╰─────────── block 1 ──────────╯
               ╰────────────────── one merkle leaf / digest ──────────────────╯
```

- `byte_ptr = size_of::<F>() * cell_idx` (host-memory stride; ≠
`bus_ptr`)
- `bus_ptr  = BUS_PTR_SCALE   * cell_idx` (same formula as u16 ASes)
- `BUS_BLOCK_STRIDE` and `BUS_LEAF_STRIDE` unchanged — the bus addresses
cells, not bytes.

</details>

### Address-space cell types

- `RV64_REGISTER_AS`, `RV64_MEMORY_AS`, `PUBLIC_VALUES_AS` switch to
u16-celled.
- `DEFERRAL_AS` stays F-celled (moves to u16 in a follow-up).

### Persistent memory and merkle

- Boundary AIR, merkle tree, merkle public-values, and CUDA inventory
rewritten for 4-cell-per-block u16 leaves (`BLOCKS_PER_LEAF =
DIGEST_WIDTH / BLOCK_FE_WIDTH = 2`).
- `assert_public_values_shape::<DIGEST_WIDTH>(num_public_values_bytes)`
and `SystemConfig::public_values_cell_count()` formalize the
public-values byte-count → cell-count contract at construction and at
every read site.

### Offline-checker bridge

File: `crates/vm/src/system/memory/offline_checker/bridge.rs`.

- `MemoryBridge::read` / `write` drop the `const N` generic; bus payload
is always `[T; BLOCK_FE_WIDTH]`.
- Unified `MemoryWriteAuxInput` lets `write` accept either an
`&MemoryWriteAuxCols<V, BLOCK_FE_WIDTH>` (standard case) or
`from_prev_data_exprs(&base_aux, prev_data_expr)` for chips that
materialize prev_data in their own columns (Keccak, SHA-2, LoadStore).
- Added `pack_u8_block` (AB::Expr) and `pack_u8_block_value` (F-value
form) so byte-shaped chips and testing helpers produce bus messages
identical to what u16 chips emit natively.
- Removed `read_or_immediate` / `MemoryReadOrImmediateOperation` (an
RV32 pattern superseded by `RV64_IMM_AS`).

### TracingMemory / GuestMemory byte view

- `read_bytes` / `write_bytes` (cell-type-aware, BLOCK-aligned) for raw
byte access against u16-celled storage.
- Deferral-specific helpers (`timed_read_deferral`,
`tracing_read_deferral`, …) for the F-celled AS.

### Chip call-sites

~50 sites across riscv, riscv-adapters, keccak256, sha2, deferral.

- Every `memory_bridge.read` / `.write` now hands the bridge a
`BLOCK_FE_WIDTH`-shaped payload.
- Byte-shaped chips wrap their `[T; 8]` u8 columns with `pack_u8_block`.
- Register-read paths still zero-extend the low 32 bits to 8 bytes, then
pack.
- Per-AS metered leaf-bit shift in
`execution_mode/metered/memory_ctx.rs` (`U16_AS_LEAF_PTR_BITS = 4` for
byte-pointer ASes, `CELL_AS_LEAF_PTR_BITS = 3` for cell-pointer ASes),
so RV64 byte pointers produce the right leaf label.

### Algebra / ECC adapter cleanup

No-op semantic; finishes upstream removal.

- Dropped phantom `BLOCK_SIZE` const generic from
`ModularAir`/`ModularChip`, `Fp2Air`/`Fp2Chip`,
`WeierstrassAir`/`WeierstrassChip` and their `get_*_air` / `get_*_chip`
helpers.
- The generic became unused after `Rv64VecHeapAdapter*` dropped its
`BLOCK_SIZE` parameter; keeping it was a phantom that broke inference at
call sites.

### Testing infrastructure

- `MemoryTester` / `DeviceMemoryTester` accept either `N =
BLOCK_FE_WIDTH` (cell index) or `N = MEMORY_BLOCK_BYTES` (byte pointer
in a u16 AS) and route accordingly; the byte path packs the result so
the dummy chip's bus message matches a byte-shaped chip's exactly.
- `pack_u8_block_value` is shared between the testing helper and the
CUDA boundary-chip test that previously hand-rolled the same packing.

### CUDA

- `params.cuh` constants + static_asserts kept in sync.
- Kernel memory-aux fill paths updated for the new block shape.

## Out of scope (follow-up PRs)

- Trace column narrowing now that u16-celled storage matches column
width.
- Redundant range-check removal on byte-shaped chips.
- Dead bitwise-lookup cleanup.
- SHA-2 / Keccak state reshaping to natively-u16 layouts.
- Moving `DEFERRAL_AS` to u16.
- Deferral commit / output-hash changes.

## Migration notes

- `MemoryBridge::read::<N>` / `::write::<N>` calls: drop `<N>`; pass a
`BLOCK_FE_WIDTH`-shaped payload. Byte-shaped data goes through
`pack_u8_block` (constraint side) and `pack_u8_block_value` (witness
side).
- `read_or_immediate` callers: switch to `RV64_IMM_AS`-based addressing.
- `MemoryWriteAuxCols<V, N>` with `N != BLOCK_FE_WIDTH`: use
`MemoryWriteAuxInput::from_prev_data_exprs(&base_aux, prev_data)` to
supply prev_data inline.
- `ModularAir`, `ModularChip`, `Fp2Air`, `Fp2Chip`, `WeierstrassAir`,
`WeierstrassChip` now take one fewer const generic (`BLOCK_SIZE`
dropped).

Resolves INT-7829

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com>
PR 3 of the memory-bus-u16 split, stacked on **#2794** (disjoint from PR
2). Shrinks SHA-2 and Keccak chip trace columns that are only consumed
as packed u16 cells through the memory bus.

## Why

#2794 still stores SHA-2 main-chip state / message bytes and Keccak
state / pointer columns byte-shaped, packing them at memory-bus call
sites via `pack_u8_block`. Because these columns are never inspected
byte-by-byte by the AIR (the constraints operate on packed u16 values,
or on byte-XOR buffers that intentionally stay byte-shaped), they can be
stored u16-shaped directly — halving the affected trace columns and
removing one `pack_u8_block` per row.

This is the second of the four cell-narrowing PRs (after PR 2 for RV64
control-flow) and is disjoint from it: the only files in this PR live
under `extensions/sha2`, `extensions/keccak256`, and
`crates/circuits/sha2-air`, plus one helper addition in
`extensions/riscv/circuit/src/adapters/mod.rs`.

## What changes

### SHA-2 section

Files:
`extensions/sha2/circuit/src/sha2_chips/{main_chip,block_hasher_chip}/`,
`crates/circuits/sha2-air/src/{air,columns,trace}.rs`,
`extensions/sha2/circuit/cuda/`.

Main chip:

- `prev_state`, `new_state` are u16 cells.
- `message_bytes` reshapes to `message_u16s` (byte pairs packed at the
call site).
- Pointer limbs `dst_ptr_limbs`, `state_ptr_limbs`, `input_ptr_limbs`
reshape to u16 where the AIR only consumes them as packed pointer cells;
the high u16 cell gets the scaled 16-bit range check.

Block hasher:

- `prev_hash`, `final_hash` are u16 cells.

`sha2_bus`:

- `MESSAGE_1` / `MESSAGE_2` payloads are packed to u16 on both sender
and receiver.

Stays byte-shaped (8-bit bounded by construction):

- `carry_a`, `carry_e`, `carry_or_buffer` in `sha2-air`.

### Keccak section

Files: `extensions/keccak256/circuit/src/{keccakf_op,xorin}/`,
`extensions/keccak256/circuit/cuda/`.

`keccakf_op`:

- `preimage`, `postimage` reshape to u16 cells.
- `buffer_ptr_limbs` reshape to u16; pointer composition uses the scaled
16-bit bound.
- `keccakf_state_bus` payload between `keccakf_op` and `keccakf_perm` is
packed to u16.

`xorin`:

- Pointer limbs u16-shaped (same treatment as `keccakf_op`).
- Byte-XOR buffers (`preimage_buffer_bytes`, `input_bytes`,
`postimage_buffer_bytes`) stay byte-shaped — those columns still carry
byte-level XOR data and the AIR inspects them directly.

CUDA tracegen mirrors the new shapes; `keccakf_op` drops the
`bitwise_lookup` parameter from its kernel and ABI signatures now that
pointer bytes no longer need byte-level checks. The call chip still
wires `bitwise_lu`.

### Adapter helper (cross-cut)

File: `extensions/riscv/circuit/src/adapters/mod.rs`.

- Add `expand_to_rv64_block`: zero-pads `N` u16 limbs to one RV64 bus
block (`BLOCK_FE_WIDTH` cells). Used by `keccakf_op` and `xorin` for
pointer reads against u16-celled `RV64_REGISTER_AS`.
- `BLOCK_FE_WIDTH` imported into the adapters module to support the new
helper.

The existing `expand_to_rv64_register` (8 u8 limbs, byte-shaped) is kept
for byte-shaped chips and the legacy paths.

## Migration notes

- Chips passing SHA-2 main-chip state through the bus should hand the
bus u16 cells directly (no `pack_u8_block`).
- Pointer reads against u16-celled `RV64_REGISTER_AS` from chips storing
only the low 32 bits of the register should use `expand_to_rv64_block`
(BLOCK_FE_WIDTH cells, zero-padded) instead of `expand_to_rv64_register`
(8 u8s).
- Keccak pointer composition should use the scaled 16-bit bound on the
high u16 cell; the previous byte-pair range check on
`bitwise_lookup_bus` is removed for `keccakf_op`.

resolves int-7832, int-7833

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 2 of the memory-bus-u16 split. This migrates selected RV64 chips from
byte-shaped register/memory columns to u16-shaped columns after #2794
made the memory bus operate on fixed `BLOCK_FE_WIDTH = 4` u16 cells.

The scope is intentionally split into two classes:

- For equality/comparison-style chips, `hintstore`, and the new u16
adapters, the main change is the limb shape: `u8` limbs become `u16`
limbs, byte-pair bitwise range checks become per-limb
`VariableRangeChecker` checks, and the core logic remains essentially
the same.
- For control-flow instructions that previously used byte-level
decompositions for PC/result arithmetic (`auipc`, `jalr`, `jal_lui`),
the constraints were reworked more substantially because the old carry
structure was tied to 8-bit limbs.

Subsequent PRs build on the u16 adapter variants and bigint limb
constants introduced here.

## Why

#2794 made the memory bus fixed-shape over packed u16 cells. Some chips
still materialized 8 byte columns per RV64 register value, packed them
into u16 bus cells at the memory interaction, and carried
`BitwiseOperationLookup` only to prove those byte limbs were canonical.

For chips that do not inspect byte-level structure, storing u16 limbs
directly halves the affected register-value trace width and removes
those byte-pair lookup dependencies. The bus sees the native u16
payload, while the AIR uses `VariableRangeChecker` to prove each
materialized u16 limb is bounded.

For PC/result arithmetic, the old byte AIRs also avoided field-modulus
aliasing by carrying through small limbs. The u16 rewrites preserve that
property with fewer, wider limbs; they do not replace 32-bit arithmetic
with one unconstrained field equality.

## What Changes

### Mostly Mechanical Limb-Shape Migrations

These chips keep the same core argument, but change the limb type/width:

- `branch_eq`: compares u16 limbs directly. Equality inverse logic is
unchanged.
- `branch_lt`: signed/unsigned comparison logic is unchanged, with
`LIMB_BITS = 16` and u16 MSB handling.
- `less_than`: same comparison migration, wired through the u16 ALU
adapter.
- `hintstore`: stores `mem_ptr`, `rem_words`, and hint data as
u16-shaped columns. `rem_words` fits in one u16 cell, and pointer bounds
are checked through the high u16 limb.
- Bigint `BranchEqual256`, `BranchLessThan256`, and `LessThan256`:
inherit the u16 shape with `INT256_NUM_U16_LIMBS = 16`.

The new u16 adapters follow the same pattern: register pointer values
and heap/register payloads are represented as u16 cells directly, with
range checks moving from byte-pair lookup to `VariableRangeChecker`.

Byte-shaped bigint chips stay byte-shaped:

- `BaseAlu256`, `Mul256`, `Shift256`, and `Sll256` still use
`INT256_NUM_U8_LIMBS = 32` because their AIRs rely on byte-level
boolean, carry, or shift structure.

### Constraint Rewrites

These chips needed more than a limb-size substitution:

- `auipc`: the old AIR decomposed the PC and immediate into byte limbs
and carried the addition byte-by-byte. The new AIR uses u16 limbs
instead: it derives/range-checks `pc_low` from `from_pc` and a witnessed
`pc_high`, then constrains `rd = from_pc + (imm << 8)` with low-u16 and
high-u16 carry bits. This keeps each arithmetic relation below the field
modulus while dropping the extra byte-limb PC columns.
- `jalr`: the old AIR used low-32-bit byte limbs for `rs1`, `to_pc`, and
`rd = pc + 4`. The new AIR keeps u16 limbs, derives the low rd limb from
`from_pc`, range-checks the high rd limb against `PC_BITS`, and carries
`rs1 + sign_extend(imm)` over two u16 limbs while separately enforcing
the cleared target low bit.
- `jal_lui`: JAL derives the low rd limb from `from_pc + 4` and the
witnessed high u16 limb, then range-checks the high limb against
`PC_BITS`. LUI uses a u16-specific split because `imm << 12` crosses a
u16 boundary: `rd[0] = imm_low_4 << 12` and `imm = imm_low_4 + (rd[1] <<
4)`.

The old byte-limb form would still be possible, but the u16 form uses
fewer witness columns/range checks while preserving limb-by-limb integer
soundness in the BabyBear field.

### New u16 Adapter Variants

- `Rv64BaseAluU16Adapter`: reads/writes one `[u16; BLOCK_FE_WIDTH]`
register block directly and handles immediates as u16-low-limb plus sign
bit.
- `Rv64VecHeapU16Adapter`: u16-cell heap reads/writes with u16-shaped
register pointer values.
- `Rv64VecHeapBranchU16Adapter`: read-only branch variant for u16-shaped
heap operands.
- `VecToFlatAluAdapter` gains a `BLOCK_VALUE_WIDTH` parameter so both
byte-shaped and u16-shaped heap adapters can share the flattening logic.

The legacy byte-shaped adapters remain for chips that still need
byte-level constraints.

### Wiring

- RISC-V and bigint extension wiring now threads `VariableRangeChecker`
into the u16-shaped chips.
- `BitwiseOperationLookup` remains wired only for byte-shaped chips that
still need byte-level constraints.
- Rv64Io no longer creates a bitwise lookup solely for `hintstore`.

### CUDA Parity

CUDA tracegen mirrors the Rust changes:

- migrated core records use `uint16_t` arrays where applicable;
- kernels use `VariableRangeChecker.add_count(limb, LIMB_BITS)` for u16
limbs;
- bitwise lookup parameters are removed from migrated kernels;
- bigint CUDA wiring separates byte-shaped chips from u16-shaped chips.

### Helper Cleanup

- `INT256_NUM_U8_LIMBS` and `INT256_NUM_U16_LIMBS` make bigint limb
width explicit.
- Shared RV64 helpers cover byte/u16 conversion, low-32-bit pointer
splitting, pointer-bound range-check values, and zero-extension to one
u16 memory block.
- Byte-view memory helpers use explicit `read_bytes` / `write_bytes`
where the access is intentionally byte-addressed.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rs (#2805)

PR 4 of the memory-bus-u16 split, stacked on **PR 2**. Shrinks adapter
trace columns that only reconstruct packed memory-bus payloads, and
migrates the algebra modular `is_eq` chip to u16 limbs.

This PR sits on PR 2 (not #2794) because the PR 2 squash extended the
shared vec-heap u16 adapter family and dropped the `INT256_NUM_U8_LIMBS`
from `riscv/circuit::adapters`; PR 4 builds on those changes.

## Why

#2794 already reshaped these adapters' memory-aux columns and bus
payloads to packed u16 blocks, but the **trace columns** that
reconstruct those payloads were still byte-shaped:

- `alu_w` / `mul_w` stored `rs1_high` / `rs2_high` as 4 u8 limbs even
though the bus consumes them as 2 u16 cells.
- `loadstore` stored `rs1_data` as 8 u8 limbs even though pointer
arithmetic only needs the low 32 bits as 2 u16 cells.
- The byte-shaped vec-heap adapters (`vec_heap`, `vec_heap_branch`,
`eq_mod`) reconstructed register pointers through byte limbs and bitwise
byte-pair range checks.

All of these can store the column as u16 cells directly and replace the
high-byte range check with a scaled 16-bit range check on the high u16
cell, removing the `BitwiseOperationLookup` traffic.

Separately, algebra's modular `is_eq` chip operates on packed u16
operands and only needs limb-level range checks at `LIMB_BITS = 16`, so
it can migrate to the new u16 adapter introduced in PR 2.

## What changes

### RISC-V W-adapters and loadstore

Files:
`extensions/riscv/circuit/src/adapters/{alu_w,mul_w,loadstore,mod}.rs`,
`extensions/riscv/circuit/src/loadstore/{execution,tests}.rs`,
`extensions/riscv/circuit/src/{base_alu_w,divrem_w,mul_w,shift_w}/tests.rs`,
`extensions/riscv/circuit/cuda/include/riscv/adapters/{alu_w,mul_w,loadstore}.cuh`.

- `alu_w` / `mul_w`: `rs1_high`, `rs2_high` are packed as two u16 cells;
bus reads zero-extend the lower 32 bits and pull u16-shaped upper cells
directly from the trace.
- `loadstore`: `rs1_data` shrinks to `RS1_DATA_U16S = 2` u16 cells (the
low 32 bits of the rs1 register). Bus reads zero-extend the upper half
on the wire.
- STORE-to-`DEFERRAL_AS` is explicitly unsupported in this adapter and
the now-dead `timed_write_deferral` path is removed — #2794 already
dropped that opcode, and the remaining call would have instantiated
`TracingMemory::write::<F, 8>`, which violates `BLOCK_FE_WIDTH`. The
F-celled path needs a separate two-message bus shape if it is ever
reintroduced.
- Shared helpers handle u8-pair packing, low-32-bit u16 composition,
high-u16 pointer scaling, and 16-bit immediate sign-extension.
- W-suffix tests (`base_alu_w`, `divrem_w`, `mul_w`, `shift_w`) updated
to prank u16 cells where the column is migrated.

### Shared heap-data adapters

Files:
`extensions/riscv-adapters/src/{vec_heap,vec_heap_branch,eq_mod,eq_mod_u16,lib}.rs`,
`extensions/riscv-adapters/cuda/include/riscv-adapters/{constants,vec_heap,vec_heap_branch}.cuh`.

- `vec_heap`, `vec_heap_branch`, and `eq_mod` materialize register
pointer values (`rs_val` / `rd_val`) as u16 cells.
- Pointer bounds use `VariableRangeChecker` on the high u16 cell instead
of byte-pair `BitwiseOperationLookup` send_range.
- Adds the byte-shaped vec-heap adapter `READ_SIZE` / `WRITE_SIZE` const
generics so byte-shaped 256-bit chips (`BaseAlu256`, `Mul256`,
`Shift256`) and u16-shaped 256-bit chips (`LessThan256`) can share the
same wrapper without losing the block-size information needed for
`MemoryWriteAuxCols` and pointer arithmetic.
- Adds `Rv64IsEqualModU16Adapter*` for the new u16-shaped modular
`is_eq`.
- CUDA mirrors the new column shapes and packing helpers.

### Algebra modular `is_eq`

Files:
`extensions/algebra/circuit/src/modular_chip/{is_eq,addsub,muldiv,mod,tests}.rs`,
`extensions/algebra/circuit/src/extension/{modular,hybrid,fp2}.rs`,
`extensions/algebra/circuit/src/{execution,fields,preflight,lib}.rs`,
`extensions/algebra/circuit/src/fp2_chip/{addsub,mod,muldiv,tests}.rs`.

- `LIMB_BITS` 8 → 16 where the chip's limbs are only consumed as packed
u16 values.
- Wires through the new `Rv64IsEqualModU16` adapter variant.
- Core uses `VariableRangeChecker` for `b_lt_diff - 1` and `c_lt_diff -
1` instead of byte-pair bitwise checks.
- CPU, hybrid, and CUDA registration paths all use
`Rv64IsEqualModU16Adapter*`.
- Negative tests that previously pranked `[F; 4]` byte arrays now prank
`[F; 2]` u16 cells where the column is migrated; positive coverage
preserved.
- Mod-builder generated AIRs are untouched.

`is_eq` executor memory reads were already adjusted in #2794 for
`MEMORY_BLOCK_BYTES`; PR 4 does not rework that API.

### Cascade through dependents

Files: `extensions/bigint/circuit/src/{lib,extension/mod,tests}.rs`,
`extensions/bigint/circuit/cuda/src/bigint.cu`,
`extensions/ecc/circuit/src/weierstrass_chip/{add_ne,double}/mod.rs`.

- Bigint type aliases and ecc weierstrass call sites pick up the new
`READ_SIZE` / `WRITE_SIZE` generic positions on `Rv64VecHeapAdapter*`.
Byte-shaped 256-bit chips pass `MEMORY_BLOCK_BYTES`; u16-shaped chips
pass `BLOCK_FE_WIDTH`.
- Bigint pulls `INT256_NUM_U8_LIMBS` from the bigint crate root (per PR
2's reviewer feedback) instead of `riscv/circuit::adapters`.

## Migration notes

- `Rv64VecHeapAdapter*` now takes 5 const generics (`NUM_READS,
BLOCKS_PER_READ, BLOCKS_PER_WRITE, READ_SIZE, WRITE_SIZE`). Byte-shaped
256-bit chips: pass `MEMORY_BLOCK_BYTES` for both sizes. u16-shaped
chips: use `Rv64VecHeapU16Adapter`.
- `alu_w` / `mul_w` adapter callsites that materialized `rs_high` bytes
should pull u16-shaped upper cells from the new column slot.
- `loadstore` `rs1_data` columns shrink from 8 u8s to 2 u16 cells;
trace-fillers that previously wrote 8 byte cells should write 2 u16
cells (low 32 bits) and let the AIR zero-extend on the bus.
- `is_eq` modular chip callers should switch from
`Rv64IsEqualModAdapter*` to `Rv64IsEqualModU16Adapter*` and pass
`LANE_SIZE = BLOCK_FE_WIDTH` (= 4 u16 cells per heap-read block).
- STORE-to-`DEFERRAL_AS` is no longer routed through
`Rv64LoadStoreAdapter`; the F-celled deferral-store path was already
removed in #2794 and any reintroduction needs a dedicated two-message
bus shape.

resolves int-7830, int-7831
@shuklaayush shuklaayush force-pushed the develop-v2.1.0-rv64 branch from 72d313f to e6803a8 Compare June 4, 2026 10:04
shuklaayush and others added 3 commits June 4, 2026 12:05
PR 5 of the memory-bus-u16 split, stacked on PR 1. Two independent parts.

Part 1 — internal u16 reshape, OutputKey byte layout unchanged:
- Call core: input_commit, output_commit, output_len trace columns
  reshape to u16 cells.
- Output core: output_commit, output_len reshape to u16 cells.
- Canonicity sub-AIR over u16 cells.
- Deferral rd_val / rs_val register-pointer columns become u16-shaped.
- CPU + CUDA tracegen updates.
- Reuses PR1's byte-memory and field-memory chunk helpers; no new
  generic split_memory_ops-style code is introduced.

Part 2 — output commitment-format change (breaks vkey compatibility):
- SPONGE_BYTES_PER_ROW = 2 * DIGEST_SIZE = 16; sponge_inputs cells hold
  u16 values on data rows and receive 16-bit range checks. Init row stays
  `[deferral_idx, output_len, 0, ...]` and is exempt from u16 checks.
- Two memory-bus writes per output row instead of one.
- def_fn::hash_output_raw packs byte pairs into each sponge cell.
- Bitwise-lookup wiring removed from the output chip only — it no longer
  needs byte-level range checks. The call chip still uses bitwise.
- CUDA + tests updated; expected output commits regenerated.

Output length contract:
- output_len remains a byte count.
- Raw deferral outputs must be a multiple of 16 bytes; non-16-byte
  aligned raw outputs are invalid (no implicit padding).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shuklaayush shuklaayush force-pushed the feat/memory-bus-u16-pr5 branch from d29250f to d457878 Compare June 4, 2026 10:27
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Claude finished @shuklaayush's task in 3m 33s —— View job


Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs

  • Gather context and read the full diff
  • Review deferral call chip changes (air, trace, tests)
  • Review deferral output chip changes (air, trace, execution, tests)
  • Review canonicity sub-AIR changes
  • Review utility functions and helpers
  • Review CUDA tracegen changes
  • Review verify-stark and continuations changes
  • Review guest programs and integration tests
  • Address reviewer comments from previous reviews
  • Post final review

Reviewed all 35 changed files across Rust AIR, tracegen, CUDA, verify-stark, continuations, riscv-adapters, and the algebra is_eq fix. The migration from byte cells to u16 cells is mechanically clean and the doubled sponge absorption rate is correctly threaded through all layers. No correctness bugs found.


Findings

1. COMMIT_MEMORY_OPS derivation is semantically stale (Nit — numerically correct)

utils.rs:36: COMMIT_MEMORY_OPS is still defined as num_byte_memory_ops(COMMIT_NUM_BYTES) = 32 / MEMORY_BLOCK_BYTES = 4. But its only uses now call split_f_memory_ops::<_, COMMIT_NUM_U16S, COMMIT_MEMORY_OPS>, which operates on u16 cells — the const-assert inside split_f_memory_ops verifies COMMIT_NUM_U16S == COMMIT_MEMORY_OPS * BLOCK_FE_WIDTH (16 == 4×4). The value is correct because COMMIT_NUM_BYTES / MEMORY_BLOCK_BYTES == COMMIT_NUM_U16S / BLOCK_FE_WIDTH, but the derivation formula doesn't reflect the u16 migration. Changing to num_f_memory_ops(COMMIT_NUM_U16S) would be cleaner and prevent confusion if these constants ever diverged.

2. Missing per-cell u16 range checks on input_commit reads is intentional (Confirmed correct)

call/air.rs:167-176 range-checks only the canonicity remainder values for input_commit, while output_commit and output_len get both canonicity RCs and per-cell u16 RCs (lines 181-191). This is correct: input_commit is read from RV64_MEMORY_AS via the memory bridge, so the u16 memory bus guarantees cell canonicity. output_commit and output_len are written, so the chip must prove the written values are canonical u16s.

3. is_eq.rs setup-row fix is correct and necessary (Companion fix)

extensions/algebra/circuit/src/modular_chip/is_eq.rs:402-427: During setup, the record's c carries the modulus value itself (not a value < modulus). With u16 limbs, the previous assert!(c_cmp) would fire since c == modulus. The fix:

  • Skips c < modulus assertion during setup.
  • Forces b_diff_idx = READ_LIMBS (out-of-range sentinel) and searches c_diff_idx via reverse scan with .unwrap_or(0).
  • Forces c_lt_mark = F::TWO during setup, avoiding the single-diff-index optimization path that would conflate b and c canonicity.
  • Uses wrapping_sub for c_lt_diff since c[i] == modulus[i] during setup yields 0 via wrapping.

The execution side (execute_e12_impl) is also correctly guarded with if !IS_SETUP (line 730).

4. def_fn.rs:91F::from_usize upgraded to F::from_u32 (Addressed)

The latest commit now uses F::from_u32(u32::try_from(output_ref.len()).unwrap()), which catches unexpected sizes earlier. This addresses the concern from previous reviews.

5. Sponge init-row range check exemption is correct (Positive)

output/air.rs:334-339: Data-row sponge cells get per-cell u16 range checks gated on local.is_valid - local.is_first, correctly exempting the init row. The init row's sponge_inputs are pinned by assert_array_eq against already-canonical deferral_idx and output_len columns — no separate range check needed.

6. Verify-stark RangeCheckerAir<U16_BITS> is a clean addition

guest-libs/verify-stark/circuit/src/lib.rs:108-110 and trace.rs:47-50: The output commit range checks now use a dedicated RangeCheckerAir<U16_BITS> proving context instead of being aggregated into the range_inputs vec. The trace generator uses RangeCheckerCpuTraceGenerator to collect counts and generates the range checker trace row-major. trace.rs:142 correctly returns range_inputs: vec![] since the range checker now has its own AIR. As noted in reviewer comments, a future PR could unify this with the recursion range checker by bumping it to 16-bit.

7. combine_output_cells + expand_to_rv64_block size coupling is compile-time safe

Both call/air.rs:451 and output/air.rs:307 pass output_len_full: [AB::Expr; BLOCK_FE_WIDTH] (= 4 elements, zero-extended from F_NUM_U16S = 2) to combine_output_cells which expects [T; OUTPUT_LEN_NUM_U16S] (= 4). The equality OUTPUT_LEN_NUM_U16S == BLOCK_FE_WIDTH holds because 8 / 2 == 4, and any mismatch would be a compile error.

8. CUDA and Rust constant + struct consistency (Correct)

def_types.h mirrors every u16-cell constant from utils.rs: COMMIT_NUM_U16S, OUTPUT_LEN_NUM_U16S, OUTPUT_TOTAL_NUM_U16S, SPONGE_BYTES_PER_ROW, SPONGE_ROW_MEMORY_OPS, F_NUM_U16S, RV64_PTR_U16S. The helper functions (pack_u8_pairs_le, u32_bytes_to_le_u16_cells, scale_u16_high_cell, scale_output_len, scale_rv64_ptr_from_u32_bytes) have equivalent Rust counterparts. The CUDA column structs (DeferralOutputCols, DeferralCallCoreCols, DeferralCallAdapterCols) match the Rust #[repr(C)] layouts, and sizeof assertions guard against drift.

9. BitwiseOperationLookup fully removed from deferral (Clean removal)

extension/mod.rs no longer creates/looks up bitwise chips, cuda_abi.rs drops d_bitwise, extension/cuda.rs drops get_or_create_bitwise_op_lookup, all tests drop bitwise periphery. The canonicity sub-AIR and per-cell range checks now use VariableRangeCheckerBus exclusively.

10. e2e.rs test correctly delegates to generate_deferral_results (Good cleanup)

crates/continuations/src/tests/e2e.rs:134-145: compute_output_f_commit now delegates to generate_deferral_results + deferral_poseidon2_chip() instead of reimplementing the sponge hash. This eliminates the risk of the test diverging from production when the sponge rate changes.

11. Guest test outputs correctly updated for 16-byte alignment

All guest programs (single.rs, multiple.rs in both deferral/tests/ and continuations/) extend expected outputs from 8 to 16 bytes. The new test output_raw_len_must_be_sponge_row_aligned verifies the panic path. The deferral/tests/src/lib.rs input arrays are extended to 16 bytes.


Addressing reviewer comments

  • @stephenh-axiom-xyz — u8 vs u16 in records: The byte-shaped record types (DeferralCallReadsBytes, DeferralCallWritesBytes) are intentional. The execution record captures byte-level data; the trace filler packs bytes into u16 column cells. The column types are now u16-shaped.

  • @stephenh-axiom-xyzBitwiseOperationLookupAir for u16 ops: Correctly noted by @shuklaayush that the table would need 2^32 entries, making it impractical.

  • @stephenh-axiom-xyz — recursion range checker: The current PR adds RangeCheckerAir<U16_BITS> in verify-stark. A future PR could unify with the recursion range checker.

  • @GunaDDmultiple.elf: Pre-compiled ELF for the updated guest program — expected to be committed alongside the source change.

  • @GunaDDRootCircuit::new assertion: Upgrades existing debug_assert! to assert! via centralized assert_user_pvs_shape. The requirements were always needed but not checked in release builds.


Summary

The PR is well-executed. The u16 cell migration is consistent across all layers, the doubled sponge rate is correctly factored through shared constants, and the is_eq.rs companion fix is a necessary correction for setup-row correctness with u16 limbs. The only actionable nit is the COMMIT_MEMORY_OPS derivation formula (finding #1), which doesn't affect correctness.

Verdict: Looks good to merge.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
group app.proof_time_ms app.cycles leaf.proof_time_ms
fibonacci 1,555 4,000,051 438
keccak 13,965 14,365,133 2,378
sha2_bench 9,193 11,167,961 1,426
regex 1,586 4,090,656 356
ecrecover 488 112,210 268
pairing 596 592,827 253
kitchen_sink 1,991 1,979,971 404

Note: cells_used metrics omitted because CUDA tracegen does not expose unpadded trace heights.

Commit: d457878

Benchmark Workflow

@shuklaayush shuklaayush force-pushed the develop-v2.1.0-rv64 branch 4 times, most recently from 2995a72 to e582d73 Compare June 7, 2026 11:31
@jonathanpwang jonathanpwang removed their request for review June 9, 2026 01:08
@shuklaayush shuklaayush force-pushed the develop-v2.1.0-rv64 branch from dac2c10 to db32ec4 Compare June 10, 2026 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.